


ee ee eee ae ee i Os Os Ges OC GT GS @ J DD 


THE JOURNAL OF 
EDUCATIONAL PSYCHOLOGY 





ae 





Volume XXIV 


“January, 1 933 Number 1 


THE INFLUENCE OF IRRELEVANT REWARDS! 


EDWARD L. THORNDIKE 


And the Staff of the Division of Psychology, Institute of Educational Research, 
Teachers College, Columbia University. 














— -——- 





By any reasonable theory which permits the consequences of a 
mental connection to strengthen the connection at all, we should expect 
relevant consequences to strengthen it more than irrelevant, other 
things being equal. If a hungry animal, trying one after another of 
n alleys in the attempt to escape from the choice pen (C) to the familiar 
food pen (F), finds food in alley four equal in amount to what he finds 
in the food pen by escape through alley six, we should expect the 
strengthening of C — 6 caused by one occurrence of it and its conse- 
quences to be greater than the strengthening of C — 4 caused by one 
occurrence of it and its consequences. A relevant reward is more 
likely to “‘belong’”’ to the connection; for it belongs more to the set or 
adjustment in which the connection is. It is more likely to satisfy 
the animal, or likely to satisfy the animal more. 

By extreme forms of purposivism and by certain other doctrines a 
truly irrelevant reward can have no strengthening influence whatever. 
Its incongruity with the animal’s purposes bars it from any dynamic 
attachment to the animal’s course of behavior. By such doctrines, it is 
not enough for a consequence of a connection to be satisfying; it must 
satisfy the particular purpose or craving which is then and there caus- 
ing the animal’s behavior. It is not enough for an after-effect to belong 
to a connection casually as its felt sequent; it must make with it some 
comprehensible and “‘real”’ unit. 

We have devised and carried out an experiment to measure the 
influence of various sorts of irrelevancy in the satisfying after-effects of 
connections. 





1 The investigation reported here was a part of a general investigation of interest 
and motive supported by a grant from the Carnegie Corporation. 
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A series of cards numbered from one hundred thirty-two to four 
hundred fifty-eight was prepared consisting of: 


(A) Fifty-four cards containing well written and interesting quotations. 

(B) Seventy-three cards containing less interesting quotations. 

(D) Eighty cards containing extremely uninteresting statements. 

(J) Twenty-one cards containing jokes. 

(P) Ten cards containing amusing pictures (mostly from Punch). 

(M) Ten cards containing the statement ‘‘Hand this to the examiner and you 
will receive three cents’’ (sometimes five cents). 

(N) In the case of seventy-nine of the numbers there were no cards. 


The subjects at the start knew nothing about the constitution of the 
series. They were instructed as follows: 


Here are several hundred cards numbered from one hundred thirty-two to 
four hundred fifty-seven. Your first task is to choose cards by number until you 
get one that says something about death. You will say to the experimenter three 
times ‘‘I want death. I choose two hundred ninety-six” (or whatever number 
you select). The examiner will give you that card. You will read what is on it. 
If it says anything about death, you will hand it to the examiner. He will, if 
you are right, put it in the box marked “‘Right.”’ If it does not say anything about 
death, you will put it face up in the box marked ‘‘ Wrong,” and choose again, saying 
as before, three times “I want death. I choose... .” The examiner will give 
you that card. You will read what is onit. If it says anything about death, you 
will give it to the examiner as before. If it does not, you will put it in the box of 
wrongs as before. 

You will do just the same with each of the forty-four tasks. Then you will 
repeat them. Then you will repeat them again, and again, and again as often 
as time permits. 

You will be scored and paid according to the number of tasks you complete 
successfully. So the quicker you get the right card for each task the higher score 
you will get and the more you will earn. 

If you are to participate in this experiment you must pledge your word of 
honor as follows: 

1. I will not give any information about the experiment to anybody until 
April 1. 

2. I will not receive any information from anybody about the experiment 
until April 1. 

3. I will not make any effort to remember or recall the number of any card 
after I have announced my choice. I will simply read it and dispose of it as 
directed. 

4. I will not look on the back of any card unless told to do so by the examiner. 
Besides these procedures to which you have pledged yourself, it will be convenient : 
and probably help your score considerably if you are careful to be sure that a card 
is Right before you give it to the experimenter. 
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The forty-four tasks were, in order: 
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Serizs 4.—Continued 


eee 
8. Angels, pneumonia or diphtheria . 


These four series were repeated twice or more with each of nineteen 
young women (college students). 

A card drawn from the file was put back into it after an interval 
of fifteen to twenty further choices. If a joke or picture was drawn, a 
new joke or picture with the same number was put into the file after 
such aninterval. Ifa number was chosen for which there was no card, 
the experimenter said, ‘There is no card with that number.” 

The subject thus has, as the after-effect of choosing a number, some 
one of these: 

Ac Reading a well written and interesting quotation which contains 
some word or other expression of an idea which he wants, 

Az Reading a well written and interesting quotation which does not 
contain some word or other expression of an idea which he wants, 

Bc Reading a less well written and less interesting quotation which 
contains the word or idea which he wants, 

Bz Reading a less well written and less interesting quotation which 
does not contain the word or idea which he wants, 

De Reading a very dull statement which contains some word or 
other expression of an idea which he wants, 

Dz Reading a very dull statement which does not contain the word 
or idea which he wants, 

Jz Reading a joke (and never the same one again) which does not 
contain the word or idea he wants. Ina very few cases, perhaps one in 
a thousand, the joke did contain it, but these exceptions are disregarded 
in what follows, 

Px Seeing a funny picture and reading its caption, which does not 
contain the word or idea that he wants, 

M Receiving money (three cents or five cents) then and there, 

N Being told “There is no card with that number.”’ 

Under the conditions of the game, Ac, Bc, and Dc are alike in the 
relevant reward of finding the wanted word or expression and unlike in 
the irrelevent reward of the literary value and interest of the content - 
read. Az, Bz, and Dz are alike in the relevant punishment (or lack of 
reward) of not finding the wanted word or expression, and unlike in the 
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irrelevant reward of the literary value and interest of the content read. 
The extent to which the literary value and interest are rewards depends, 
of course, upon the extent to which the subject is satisfied by literary 
quality and interest. 

Jx and Pz are like Az, Bz and Dz in the absence of any satisfaction 
by the presence of the idea or word then wanted, but unlike them in 


that they contain (with the exceptions noted above) no word or idea. 


that ever can be wanted in the experiment. They are like Ac in the 
presence of a high degree of irrelevant interest. 

M has, the first time that it occurs, an irrelevant satisfyingness, 
the amount of which varies with the subject’s valuation of money. 
After M has occurred enough times to arouse the expectation that 
choices of numbers may result in money directly as well as by way of 
the score attained, M may become a relevant reward, the subject being 
desirous to get cards with the wanted words or to get money directly. 

N is unlike all the above save Dz in producing no satisfaction what- 
ever from reading the card, and unlike all, including Dz, in lacking even 
the effect of reading the card. 

We may study the influence of Ac, Az, Bc, Bx, Dc, Dx, Jz, Px, M 
and N upon the connections with the various situations such as “I 
want heaven or angels,” “I want death,’”’ “I want roses,” ete. We 
may also study the influence of Ac, Az, Bc, Bz, ete., upon the connec- 
tions made with the general situation of wanting an expression of this, 
that, or the other of these ideas. We do the latter first. 


THE INFLUENCE OF RELEVANT AND IRRELEVANT REWARDS UPON CON- 
NECTIONS WITH THE GENERAL SITUATION 


We ask ‘‘What is the probability that a number, the choice of which 
results in Ac, will ever be chosen again by the subject in question?”’ 
We ask the same question about numbers the choice of which results in 
Az, Bc, Bz, ete. 

The first fact to note is that the influence of the relevant reward 
of finding the wanted idea or word upon the tendency to choose that 
card again in general rather than to choose some other card is small. 

For example, there were one thousand sixty-four cases where B 
cards were chosen one or more times. On the occasion of the first 
choice of these B cards the wanted idea or word was there in one hun- 
dred forty-five cases of the one thousand sixty-four. There were then 
one hundred forty-five Bc and nine hundred nineteen Br. Of the 
one hundred forty-five Bc, 77.9 per cent were chosen again at some time 
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irrelevant reward of the literary value and interest of the content read. 
The extent to which the literary value and interest are rewards depends, 
of course, upon the extent to which the subject is satisfied by literary 
quality and interest. 

Jz and Pz are like Az, Bz and Dz in the absence of any satisfaction 
by the presence of the idea or word then wanted, but unlike them in 


that they contain (with the exceptions noted above) no word or idea. 


that ever can be wanted in the experiment. They are like Ac in the 
presence of a high degree of irrelevant interest. 

M has, the first time that it occurs, an irrelevant satisfyingness, 
the amount of which varies with the subject’s valuation of money. 
After M has occurred enough times to arouse the expectation that 
choices of numbers may result in money directly as well as by way of 
the score attained, M may become a relevant reward, the subject being 
desirous to get cards with the wanted words or to get money directly. 

N is unlike all the above save Dz in producing no satisfaction what- 
ever from reading the card, and unlike all, including Dz, in lacking even 
the effect of reading the card. 

We may study the influence of Ac, Az, Bc, Bx, Dc, Dz, Jz, Px, M 
and N upon the connections with the various situations such as “I 
want heaven or angels,” “I want death,” “I want roses,” etc. We 
may also study the influence of Ac, Az, Bc, Bz, etc., upon the connec- 
tions made with the general situation of wanting an expression of this, 
that, or the other of these ideas. We do the latter first. 


THE INFLUENCE OF RELEVANT AND IRRELEVANT REWARDS UPON CON- 
NECTIONS WITH THE GENERAL SITUATION 


We ask ‘‘What is the probability that a number, the choice of which 
results in Ac, will ever be chosen again by the subject in question?”’ 
We ask the same question about numbers the choice of which results in 
Az, Bc, Bz, etc. 

The first fact to note is that the influence of the relevant reward 
of finding the wanted idea or word upon the tendency to choose that 
card again in general rather than to choose some other card is small. 
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during the four hours of the experiment. Of the nine hundred nineteen 
Bz cards, 76.6 per cent were chosen again at some time during the four 
hours of the experiment. The difference is 1.3. 

For the influence of Ac versus Az and Dc versus Dz in first choices 
we have differences of 3.1 and —1.2. The numbers and percentages 
were: There were seven hundred ninety-six A ; one hundred six were Ac, 
of which 79.2 per cent were chosen again; six hundred ninety were Az, 
of which 76.1 per cent were chosen again. There were one thousand 
one hundred seventy-four D; two hundred sixteen were Dc, of which 
73.6 per cent were chosen again; nine hundred fifty-eight were Dz, of 
which 74.8 per cent were chosen again. 

The totals for A, B and D together give 76.2 per cent chosen again 
for the c’s and 75.8 per cent chosen again for the z’s, a difference of only 


0.4. These facts and those for two, three, and four choices appear in 
Table I. 


TABLE I.—TuHE FREQUENCY OF RECURRENCE OF RESPONSES ACCORDING TO THEIR 
AFTER-EFFECTs: A, B ann D 
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For the influence of two c’s, a c and an x and two 2’s in the case of 
cards chosen twice, we have the following (we combine A, B and D 
cards to get rid of chance fluctuations): 

Of cards, the two choices of which resulted in two c’s, 72.1 per cent 
were chosen again at some time during the experiment. Of cards, the 
two choices of which resulted in one c and one z, 72.5 per cent were 
chosen again. Of cards, the two choices of which resulted in two z’s, 
72.6 per cent were chosen again. The numbers involved in these 
percentages were respectively eighty-six, five hundred thirty-four, and 
one thousand six hundred seventy. 

Similar percentages in the case of cards chosen at least three times 
are as follows: 


Per Cert 
SE a RD ee a 90.0 (N = 20) 
Producing two c’s and one Z...............005- 65.6 (N = 125) 
Producing one c and two k’s...............+... 73.1 (N = 475) 
Is die how bak ve cee ode bedwenees 74.7 (N = 1046) 


Combining the first two groups, the percentage for one hundred forty- 
five occurrences, twenty of which were ccc and one hundred twenty-five 
of which were ccz or cxc or xcc, was 69.0. 

\ Similar percentages in the case of cards chosen at least four times 
are as follows: 


Per Cant 
EE EEE EO PT Pe 75.0 (N = 4) 
Producing three c’s and one z................... 71.4 (N = 42) 
Producing two c’s and two 7’s................... 64.2 (N = 120) 
Producing one c and three z’s................... 72.7 (N = 363) 
ccc ccc keeswausd+e0eses amen 70.9 (N = 698) 


Obviously the effect of the relevant reward upon choices in general is 
not great. As will be shown later, the probabilities that a card which 
has been chosen once, twice, thrice, and four times will be chosen again 
are respectively about .738, .736, .733 and .728 if the choice has been 
neither notably rewarded nor notably punished. The excesses above 
these for the c, cc, ccc, ecx, cecc and cccz entries of Table I number only 
nine out of sixteen, with a median at 2.0. The xz, rz, rxz, crx, xxx and 
cxxzx entries of Table I include ten excesses with a median at 0.3. We 
must then expect that irrelevant satisfying after-effects of choices will, 
even if potent, show small and irregular effects upon choices of the 
cards in question in general. 
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We may measure the influence of the irrelevant reward of greater 
interest and literary value by the greater frequency of repeated choices 
of Ac over Bc, Bc over Dc, Ax over Bx, Br over Dx. We have thirteen 
comparisons of each sort available in Table I, there being no cases of Ac 
or Bc occurring four times in the first four choices. The unweighted 
average differences are 2.1 for A to Band 3.6for Bto D. The average 
differences weighted according to the size of the smaller number of cases 
used in computing the difference (by the scheme shown below) are 0.9 
for A to B and 2.0 for B to D. 














Size of smaller Weight Size of smaller Weight 
population population 
lor 2 1 210-239 15 
3 to 6 2 - 240-279 16 
7 to 12 3 280-309 17 
13 to 20 4 310-349 18 
21 to 30 5 350-389 19 
31 to 44 6 390-429 20 
45 to 59 7 430-469 21 
60 to 69 8 470-509 22 
70 to 89 4 510-549 23 
90 to 109 10 550-599 24 
110 to 139 11 600-649 25 
140 to 159 12 650-699 26 
160 to 189 13 700-749 27 
190 to 209 14 750-799 28 
800-869 29 
870-929 30 














Consider now the irrelevant satisfactions from the jokes, pictures, 
and money. The facts appear in Table II, together with those for the 
No cards. 

We compare the excesses (or deficits) over the general probabilities 
with those for the relevant rewards on the one hand and with those for 
the No cards which were as futile for the game as the J, P and M cards, 
and also lacked any irrelevant reward. The approximate general 
probabilities of .738, .736, .733, and .728 for the reoccurrence of a 
choice after one, two, three, and four choices were determined as fol- 
lows: The actual percentages of repetitions for Az, Bz, Dz, J and P . 
were averaged, giving 73.8; the actual percentages of repetition for 
Az Ax, Br Bx, Dx Dx, JJ and PP were averaged, giving 73.6; and so on. 
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The relevant rewards Dc, De Dc, De Dc Dc, Dc De De De, showed 
respectively excesses of —0.2, 2.4, 12.4 and 2.2. De De Dz and De 
De Dc Dz showed deficits of —3.3 and —7.4. In terms of the influence 
of one rewarded occurrence, we have —0.2, 1.2, 4.1, .55, —3.3 and 
—3.7. If we weight these respectively as by the number of occurrences 
(two hundred sixteen, two times fifty, three times fourteen, four times 
four, seventy and two times twenty-six),' we obtain 0.3 as the weighted 
average excess. By similar procedures applied to the excesses of M, 
J and P, we obtain 1.3 as the weighted average for one occurrence of 
M, —0.3 for one of J, and —9.8 for one of P. Choices followed by 
these irrelevant rewards are not repeated as often as choices followed 
by relevant rewards. 

In comparison with the choices resulting in “There is no card with 
that number,” M, J and P have strong positive influence. In eleven of 
the twelve cases the excess is positive. On the same per-occurrence 
basis as was used above, M to N = 3.6, J toN = 2.1,andPtoN = 0.9. 

There can be little doubt that the money reward is effective. What 
one should conclude concerning the jokes and pictures depends upon 
the following facts: The subjects could, and probably did, very 
soon come to realize that numbers evoking jokes or funny pictures 
were of no more value in the game than numbers evoking ‘There is no 
card with that number.’”’ The reading matter upon them was printed 
or typed, whereas that on the A, B and D cards was always in hand- 
writing. Consequently, a subject could be aware of the futility of his 
choice as soon as he saw the joke or picture card, even without reading 
it. The excesses of 2.1 and 0.9 over the percentage of repetitions for an 
““N”’ may then all be attributed to the irrelevant satisfyingness of the 
joke or picture, on the ground that the total effect was as if the examiner 
had said, ‘‘There is no card of that number, but you must now read 
this joke or look at this picture.’”’ On the other hand, it may be argued 
that the proper comparison is with the Az, Bz and Dz cards, since the 
J and P cards were read like them and, like them, produced no good 
result for the score. 

In my judgment the first of these views is much nearer the truth. 
If we accept this opinion provisionally until better planned experi- 
ments are made, we have 1.5 as the average difference between an 
entirely futile choice and a choice futile for the game but productive of 
an interesting joke or picture. 





1 We count De Dc Dz as equal to one Dc, and De De De Dz as equal to two De’s. 
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We may check this conclusion by observing what happens in the 
nineteen subjects’ first choices of M when the irrelevance was complete. 
Sixteen of them were repeated. This percentage (84.2) gives a higher 
excess over general probability than was found for M choices in general. 


TaBLe II.—Tue FREQUENCY OF RECURRENCE OF RESPONSES ACCORDING TO 
THerrR AFrTER-EFFEcTs: M, J, P anp N 
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a third fourth a fifth 
Percent- time or time or time or 
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effect N — effect sad — effect N — effect ad i 
recurring 
M 154 81.2 M |126 72.2 MMM} 91 79.1 MMMM\ 73 69.9 
J 144 69.4 J J | 78 78.8 |J J J| 79 69.6 JI ITI SB 78.2 
P 318 72.3 P P }231 72.3 |P P P |167 68.3 PPP P iis 73.9 
N {1027 70.2 N |799 68.7 N (541 68.2 |NNWN N |366 63.9 
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Excess Excess Excess Excess 
Over Over Over Over Over Over Over Over 
73.8 N 73.6 NN 73.3 NNN 72.8 NNNN 
M 7.4111.0| MM —1.4 3.5| MMM 5.8|10.9| MMMM -—2.9 6.0 
J —4.4/);-0.81'J J §6.21,10.1|1J7J JI J =-3.7 L.4ig JF #2 Zé 5.4 14.3 
P —1.5 21PP <—1.3 3.6)|PPP -—5.0 O1;1\P PPP 1.1 10.0 








four choices of the same M, J and P cards. 


As a further check we may observe the influence of two, three and 


As the experiment pro- 


gresses, the subjects presumably increasingly adopt a double aim— 
to get “Right” cards and to get ‘““Money” cards; and getting an M card 


becomes a more and more relevant satisfier. 


The subject becomes 


more and more fully aware that J and P cards are barren of ‘‘Rights’’, 
so that receiving one of them becomes as unsatisfactory as an N, so 
far as the game goes.' 





1 Possibly more so, since he is compelled to spend time in reading the jokes, 


which he could otherwise use in making more choices. 
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The irregularity in the differences of Table III shows that larger 
numbers of subjects are needed. So far as our data go, the M to N dif- 
ferences actually decrease with the increasing relevance of M satisfiers, 
and the J to N and P to N differences decrease little, if any, with the 
increasing irrelevance of J and P satisfiers. 

On a per-occurrence basis we have M to N equalling 11.0, 1.8, 3.6 
and 1.5 from the M, MM, MMM, and MMMM comparisons; we have 


the average of J to N and P to N as 0.65, 3.43, 0.25 and 3.04 from the 
J, JJ, JJJ, ete., comparisons. 


THE INFLUENCE OF RELEVANT AND IRRELEVANT REWARDS UPON 
CONNECTIONS WITH PARTICULAR SITUATIONS 


We now inquire concerning the influence of relevant and irrelevant 
rewards upon the tendency to repeat particular rewarded connections. 
If for example “I want roses—I choose four hundred sixty-one’’ is 
followed by a relevant reward whereas ‘‘I want roses—I choose four 
hundred sixty-two” is followed by an irrelevant reward, what are the 
comparative strengthenings of the two connections? Is there any 
more strengthening in the latter case than would happen with no reward 
at all? 

A demonstrated equalization of all other factors than the relevance 
or irrelevance of the reward would require an enormous amount of 
clerical and statistical work.! It is better to make certain justifiable 
assumptions and use the time thus saved for other experiments. 

What we have done is to record in temporal order the responses 
made to each situation by each subject, and to re-group these according 
to the response made (i.e., the number chosen) using only the three 
simple situations (“I want death,”’ “I want roses’ and “I want love’’), 
and only the first eight trials or rounds of the series. 

We then sum the occurrences for each response to give the facts for 
all nineteen subjects. 

Table IV shows a sample of the results. 

For example, the number one hundred thirty-three was chosen in 
response to “I want death,’ “I want roses” and “I want love” thirteen 
times in trial 1, zero times in trial 2, three times in trial 3, and so on as 
shown in Table IV. The after-effect of choosing it was the possible 
satisfaction of reading a well-written and estimable quotation plus the 
annoyance of finding no mention of the desired idea. The number 134 





1 Many thousands of hours. 
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was chosen in response to “I want death” and “I want roses” eight, 
three, four, four, etc., times in successive trials of the series as shown in 
Table IV. The after-effect of choosing it in these situations was the 
possible satisfaction as above, plus also the satisfaction of finding 
mention of the desired idea and receiving credit therefor. 


The number 


TaBLeE IV.—TuE FREQUENCY OF OCCURRENCE, IN Rounps ONE TO EIGHT, OF 
Eacu OF THE THREE HUNDRED TWENTY-SEVEN RESPONSES TO Eacu OF THREE 
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one hundred thirty-four was chosen in response to “I want love’’ one, 
one, two, three, etc. times in successive trials with after-effects the 
same as for the choice of one hundred thirty-three. The choices of 
one hundred thirty-five in response to the three situations numbered 
seventeen, one, three, three, etc. with always the “punishment” of 
hearing ‘“There is no card with that number.” 

The nine hundred seventy-eight records like those of Table IV are 
then combined according to the sort of after-effect. The result is 
Table V. The right hand side of Table V duplicates the left hand side, 
except that the frequencies at each trial.or round are expressed as 
percentages of the total frequency in round 1. 


TaBLe V.—TueE FREQUENCIES OF THE SoRT SHOWN IN TABLE IV, GROUPED AND 
Summep ACCORDING TO THE NATURE OF THE AFTER-EFFECT 
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There is a general decrease in the number of responses made, round 
after round, to the three situations, because the rewarded responses are 





strengthened relatively to the others and so are given earlier. The 
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total number drops from eight hundred six in trial 1 to three hundred 
forty-eight in trial 7 and three hundred ninety-eight in trial 8, the 
percentages on the number in trial 1 being one hundred, sixty-three, 
forty-four, fifty-seven, sixty-one, fifty-one, forty-three and forty-nine. 
The percentages for the relevant rewards Ac + Bc + Cc are much 
above this general average. Those for the irrelevant rewards 
(J + P + M) and Az are somewhat above it. Those for the emphatic 
and immediate ‘There is no card with that number’ are much below 
it. Comparisons between the different sorts of after-effects will 
be clearer from Table VI in which each frequency of Table V is 
expressed as a per cent of the total frequency of responses to the three 
situations in that round. Table VI shows the relative strengths in a 


composite person representing the general tendency of the nineteen 
subjects. 


Taste VI.—Tae Revative STRENGTH or CoNnNECTIONS Rounp BY Rovunpb 
ACCORDING TO THE Sort OF AFTER-EFFECT ATTACHED TO THEM: THE 
PERCENTAGE WHICH THE FREQUENCIES OF Eacu Sort ARE OF 
THE TOTAL FOR THE ROUND IN QUESTION 
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Choices resulting in the relevant reward of finding the desired idea 
and receiving credit therefor (Ac + Bc + Cc) gain greatly (7.3 per 
cent in trial 1 to 15.5 in trial 7 and 13.3 in trial 8). Among the choices 
punished by failure to find the desired idea but rewarded irrelevantly 
by high literary quality (Az), a joke (Jz) or a picture (Pz), or money 
(M), the gain is from 23.0 per cent in trial 1 to 24.7 in trial 7 and 30.1 
in trial8. The Bz and Cz choices were 38.0 per cent in trial 1, 39.3 in 
trial 7 and 40.2 in trial 8. 

Choices resulting in “There is no card of that number’ fell off 
greatly (from 31.8 per cent in trial 1 to 20.4 in trial 7 and 16.3 in trial 
8). The gains noted of Ac, Bc, Cc, Az, Px, and Jz, and the slight gains 
for Br and Cz were almost wholly at the expense of N’s. The M’s 
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alone lose slightly but the determination is from too small numbers to 
be reliable. 

The connections which have the relevant rewards (Ac + Be + Cc) 
as after-effects are strengthened most. From round 1 to 7 and 8 
the gain is 7.1 or 97 per cent (7.3 to 14.4). For the irrelevant rewards 
(J, P, M, and Az), the corresponding gain is 4.4 or 19 per cent (23.0 
to 27.4). For the connections resulting in a card that has some 
promise, which, however, is unfulfilled (Bz + Cz), the corresponding 
gain is 1.7 or 4% per cent (38.0 to 39.7). 

Both lines of evidence favor the decision that a satisfying after- 
effect strengthens somewhat the connection to which it is attached, 
even irrelevant to the purpose in the interest of which the 
connection was made and highly incongruous with the cravings and 
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PROBLEM 


There are at least two considerations which give inferential support 
to the hypothesis that performance in learning is a function of the 
personality traits of the learner. The first of these considerations is 
the observation, made repeatedly by the writers and doubtless by many 
other experimenters, that speed of learning and methods of attack 
seem to be correlated with the subjects’ personality traits. Some 
subjects, for example, take an aggressive, ascendant attitude toward 
the learning task, while others are submissive and appear dominated 
and subdued by the situation. Some use a careful and methodical 
method of attack, others an apparently impulsive and hit-or-miss 
procedure. Similarly some subjects are obviously extroverted with 
respect to the learning problem, others are introverted; some are emo- 
tionally stable and others emotionally unstable; some appear to be 
well integrated personalities, while others are clearly characterized by 
imbalance and partial dissociation. So closely do these traits seem 
to be connected with learning performance that we have frequently 
attempted prediction of performance from the subject’s behavior 
(not his score) on the first one or two trials, and these predictions have 
had a considerable accuracy, although they have lacked adequate 
quantitative treatment. It would be futile to speculate upon the 
relation of specific personality traits to learning performance, but the 
observations mentioned have led us to the tentative hypothesis that 
at least some traits of personality are important conditioning factors 
in the intricate context of conditions upon which learning performance 
depends. 

The second consideration is a priori in character. It is generally 
believed that certain desirable traits of personality and character must 
be present to make intelligence effective. To the extent that learning 
16 
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is an intelligent activity it may be expected that these traits will be 
important as factors affecting learning. 

An extended set of further assumptions underlies both of these 
considerations, but since the hypothesis is tentative and waits upon 
quantitative evidence, further analysis will be delayed. It is the 
purpose of this paper to report an initial and preliminary attempt to 
test the hypothesis with some data already at hand. 


EXPERIMENT I 


Method.—To a group of fifty-seven college men were given, on 
different days and according to the standard procedures, the Allport 
A-S Tests, the Conklin Scale of Introvert-Extrovert Interest Differ- 
ences, the Neyman-Kohlstedt Introvert-Extrovert Test, and Army 
Alpha. The same subjects were asked to study six four-line stanzas 
of a poem (Alice Brand) presented in mimeographed form to each one 
for fifteen minutes. An immediate written recall was taken and a 
delayed recall after forty-eight hours. They also learned on successive 
days, in the same order and at the rate of one list per day, two lists of 
monosyllabic nouns and two of nonsense syllables. The lists of nouns 
contained fifteen items and those of syllables ten items. A two- 
minute study time was allowed in each case, and immediate written 
recalls were taken. The recalls of the poetry were scored in terms of 
the number of lines correct, regardless of position, and the syllables 
and words were scored in terms of number correct, regardless of 
position. 

Results.—The coefficients of correlation between test scores and 
learning data appear in Table I. In the scatter diagrams the scores 
plotted against the learning records have been entered so that, pro- 
ceeding upward along the y-axis, the meanings are: Submission to 
ascendance, introversion to extroversion. The relationships shown 
in the table are uniformly low and the probable errors are of the order 
of .09. There are, however, certain uniformities. The A-S scores 
correlate negatively with five out of six of the learning records, which 
indicates a slight but consistent tendency for high learning performance 
to be associated with submissiveness as measured by this test. Scores 
on the Conklin test, likewise, have a consistent but low negative corre- 
lation with learning scores, indicating a tendency for introversion, as 
measured by this test, to be associated with better performance in 
learning. There is no consistent tendency in the correlations between 
Neyman-Kohlstedt and learning scores. Four of the six coefficients 
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TABLE I.—CoEFFICIENTS OF CORRELATION BETWEEN PERSONALITY Test ScorzEs 
AND LEARNING Data! 











Personality test 
Learning record 
A-S_ | Conklin | Neyman-Kohlstedt 
Poetry: immediate recall............... 0.03 | —0.15 0.20 
Poetry: delayed recall (percentage)...... —0.01 | —0.01 0.14 
Ne sina 6b nde ea eum ain anes —0.11 | —0.06 —0.16 
i 8 dg aa Wil a ok lee —0.13 | -—0.11 0.20 
Monosyllable List 1................... —0.15 | —0.04 0.12 
Monosyllable List 2................... —0.14 | -0.11 —0.12 














1 Tables of the means and sigmas of the correlated arrays would unnecessarily 
consume space. The time limits used in learning gave a relatively normal dis- 
tribution of scores and a good range of performance. The personality scores 
range from 50-155 (Conklin), from —44 to +46 (A-S), and from —24 to +26 
(N-K), with roughly normal distributions. 


are positive, but none is high. If anything, there is a tendency for 
extroversion, as measured by this test, to be associated with better 
learning. The Conklin and Neyman-Kohlstedt tests have an inter- 
correlation of 0.04. The A-S test correlates with these two 0.08 and 
0.24, respectively. The correlations between Army Alpha and the 
tests of personality traits are virtually zero, so that we may assume 
that intelligence is hardly an important factor in the relationships 
discussed above. 

In measurements of this kind chance variations are likely to be 
great. While correction for attenuation, when the raw Pearson r’s are 
as low as those reported here, is of doubtful significance, we shall 
report the results of such correction in the case of the A-S and Conklin 
tests. The reported reliability of the former is 0.737! and of the latter 
0.92.2 The immediate recalls of the monosyllable lists intercorrelate 
0.21, those of the syllable lists intercorrelate 0.23. Using these as 
reliability coefficients and correcting for attenuation, the correlations 
between A-S scores and the two monosyllable lists become —0.38 and 
— 0.35, and those between the same test and the syllable lists become 
—0.27 and —0.32. Likewise the relationships between Conklin 


1 Allport, G. W.: A Test for Ascendance-submission. Journal of Abnormal and 
Social Psychology, Vol. XXIII, 1928, pp. 118-136. 

* Conklin, E. S.: The Determination of Normal Extravert-introvert Interest 
Differences. Pedagogical Seminary and Journal of Genetic Psychology, Vol. 
XXXIV, 1927, pp. 28-37. 
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scores and the immediate recalls mentioned above become —0.09, 
—0.25, —0.13, and —0.24, respectively. Reliability coefficients for 
the poetry recalls and for the Neyman-Kohlstedt test are not available. 


EXPERIMENT II 


Method.—Under the same conditions as those described in Experi- 
ment I the same poem was learned and recalled both immediately and 
after forty-eight hours by a group of sixty-eight college men. Like- 
wise, by a different group of forty-three men, the same word lists and 
nonsense syllable lists were learned and recalled immediately and 
again after forty-eight hours. Each learning and each recall occurred 
on a different day. Half of the group, in this case, learned the words 
first and the other half learned the syllables first. Because of the 
counterbalanced order in which the lists were learned, the individual 
scores may not be correlated directly with personality test scores, and 
for this purpose the four lists have been combined. This combination 
assumes that transfer and interference factors are proportional for 
each subject. Deviations from fulfilment of this assumption will 
lower the correlation, but could scarcely cause a relationship, if it 
exists in any important amount, to vanish. 

Results.—Scores are available for these subjects on the Colgate 
Personal Inventory (B2 and C2), the Allport A-S test, the Pressey 
X-O Tests for Investigating the Emotions (Total Affectivity), and the 
American Council of Education Intelligence Tests (1927 edition). The 
four tests of personality traits have correlations with intelligence which 
are practically zero and intelligence may be assumed to be of minor 
significance in the production of relationships between personality 
test scores and learning without employing partial correlations. 

The correlations between the B2 and C2 scores and the learning 
data are almost all exactly zero, and those which are not zero are 
neither consistently positive nor consistently negative. The same is 
true for the correlations between total affectivity and learning. It is 
scarcely necessary to present these coefficients. The A-S test yields 
slightly more regular results which are given in Table II. 


TaBLE II.—CoEFFICIENTS OF CORRELATION BETWEEN A-S Test SCORES AND 
LEARNING RECORDS 
LEARNING REcorRD 


ee. iv wcdheriekaadeweeeeeden —0.016 
Poetry: delayed recall (percentage)..................... —0.199 
Words and syllables: immediate recall................... —0.027 
Words and syllables: delayed recall (percentage)......... —0.224 
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The same slight negative relationship which was found in Experi- 
ment I between A-S scores and learning appears again in these coeffi- 
cients. It is present, also, to a somewhat greater degree, in the delayed 
recall of the poetry and of the words and syllables. The necessary 
reliability coefficients for correction of these coefficients for attenua- 
tion are not available. 


SUMMARY 


In a preliminary experiment to test the hypothesis that personality 
traits are among the factors associated with learning, coefficients of 
correlation have been computed between several measures of person- 
ality traits and immediate and delayed recalls of poetry, words and 
syllables. The Colgate B2 and C2 Tests and the Pressey X-O Tests 
(Total Affectivity) yield no significant or consistent relationships. 
The Allport A-S Test and the Conklin I-E Test give consistent but 
very low negative correlations, which mean that the higher learning 
and recall scores tend slightly to be associated with submissiveness and 
with introversion. That the relation between introversion and learn- 
ing cannot be generalized, however, is shown by the tendency to a 
positive relationship between the Neyman-Kohlstedt scores and learn- 
ing. The virtually zero relationships between the personality scores 
and intelligence show that whatever relationships exist between the 
former and learning cannot be ascribed to the possibility that the per- 
sonality tests might be to some extent measures of intelligence. 

In the light of the measures employed, we may permit the rela- 
tively consistent tendency for submissiveness and introversion (Conk- 
lin) to be associated with better learning to have some little weight. 
The reliability of the learning records is low and only brief samples of 
learning performance have been taken. ‘The measures of personality 
employed, while having a higher reliability than the learning materials, 
are none too reliable. The fact that consistently negative correlations 
appear under these conditions may be taken to mean that the traits 
measured are among the factors correlated with learning. It is prob- 
able that many others are, also; and it is, we believe, still more probable 
that integrations of traits and larger samplings of learning performance 
would show much higher correlations. 








NEW-TYPE OR OBJECTIVE TESTS: A SUMMARY OF 
RECENT INVESTIGATIONS 
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The popularity of the new-type or objective test has resulted in a 
sizeable amount of literature, part of which is mere discussion and part 
investigation. The researches in this field up tc 1929 were summarized 
by Ruch (52)* in his The Objective or New-type Examination. Since 
the publication of this very adequate treatment of the field there have 
appeared a rather large number of studies on various phases of the 
problem. It is the purpose of this article to summarize these researches 
very briefly and relate their findings with those previously summarized 
by Ruch. Most of the studies have appeared since Ruch’s work but 
some previous studies have been included for emphasis. 


TEACHING VALUES OF NEW-TYPE TESTS 


Giving Tests.—Tests have long been recognized as measuring instru- 
ments but their actual value as instructional material has been very 
meagerly considered by most teachers. Work of Schulte (54) Maloney 
and Ruch (39), Landis (35) and Turney (63) indicate that the taking 
of examinations increases the students’ knowledge of the subject. 
Schulte (54) experimenting with college classes, one group which 
expected a final examination and one group which did not, found that 
the knowledge that there was to be a final examination produced 
worthwhile results. Maloney and Ruch (39) in experimenting with 
methods of teaching grammar found that the objective test assign- 
ment was superior to a combination of objective test and textbook and 
the textbook method was a poor third. Landis (33, p. 44) when giving 
an objective test at the beginning of the period followed by an essay 
test covering the same material, found that the answers in the essay 
test were very greatly influenced by the objective test. Turney (63) 
found that students taking twelve short objective tests in psychology 
did significantly better work than similar students not taking such tests. 

Scoring Tests —Methods of correcting papers and enabling students 
to see their mistakes seem to affect the learning of the students. Cocks 
(13, pp. 45-49) found that pupils correcting their own true-false tests 
did much better on a recall test given a week later than did the pupils 





* The number indicates the reference at the end of the article. 
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whose papers were corrected by the teacher, then returned to them. 
Curtis and Woods (18) made a much more comprehensive study includ- 
ing various types of objective tests and four methods of correcting 
papers. They conclude, 


‘When new-type examinations are used as a teaching device, the method of 
correction which requires the least of the teacher’s time, namely that under which 
the pupils check the incorrect items on their own papers during a discussion of the 
answers is most profitable’’ (18, p. 623). 


‘ Summary of Teaching Values.—The following techniques seem to 
increase the instructional value of objective tests: 
1. Informing the pupils that a final examination is to be given. 
2. Using objective tests as assignments. 
3. Giving a number of short tests throughout the course. 
4. Having the pupils correct their own papers. 


COMPARATIVE VALIDITIES 


Various Types.—The measurement of the validity of each type 
of objective test presents one of the most important problems for 
research. The difficulty in determining validity lies in the selection 
of an adequate criterion of success in the subject. Various studies 
have used different criteria and their findings must be considered in 
relation to the adequacy of these criteria. 

Corey (15) found that essay and new-type tests measure very nearly 
the same function. Eurich (22) in a comprehensive study indicated 
that essay, multiple-choice, completion and true-false tests have 
approximately equal validity. Peters and Marty (49), using school 
marks as a criterion, found that multiple choice and completion tests 
were slightly more valid than the essay examination. The essay test, 
was however, somewhat more valid than true-false questions. They 
also found that the latter were less valid in the elementary school 
than in the secondary grades. Watson and Crawford (66) using the 
differences between the extremes of the group as a criterion found* 
that in physics the completion test was most valid, followed closely by 
the essay test. The best answer type was ranked third, but con- 
siderably lower, while the true-false test was least valid. Their find- 
ings are based on an equal and rather small number of questions and 





* They presented only their conclusions without presenting their data. 
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as the authors suggest should be repeated using a larger number of 
items in the tests. Shulson and Crawford (55) in a rather poorly 
controlled experiment, indicated that the true-false and completion 
type were equally valid for college classes. 

Bayless and Bedell (3) using tests in general science designed to 
measure the use of ideas instead of mere recall and reproduction, found 
that controlled completion in story form was most valid and true-false 
least valid. 

There have been several modifications of the true-false test which 
have been studied. One modification, that of having the pupils correct 
the wrong statements, was found by Bayless and Bedell (3) to be 
more valid than the ordinary true-false. Barton (2) suggests the 
crossing out of the part of the statement which is in error but gives 
no experimental evidence as to the validity of the method. More 
study needs to be made of these and other modifications of the true- 
false test. The use of such changes increases the diagnostic value of 
the test but at the same time increases the time used in giving and 
scoring the test. 

Home-made vs. Standardized Tests——Are teacher-made objective 
tests as valid as standardized tests? Studies by Broom, Douglas 
and Rudd (8) and Wright (73) offer some indication as to the answer 
to this question. Broom, et al., (8) found that ‘‘carefully constructed 
home-made reading tests possess as good validity as standardized 
tests in the same subject matter field.”” Wright (73) constructed a 
comprehensive achievement test based on the Indiana state course of 
study for grades seven and eight. He gave both the Indiana Com- 
posite and the Stanford Achievement to a group of pupils and corre- 
lated the results with teachers’ marks by subjects. He found that 
the reading and arithmetic tests of each battery correlated about the 
same with teachers’ marks but the correlations of the history and 
language tests of the Stanford battery were much lower than the 
correlations of the same tests in the Indiana battery. This would seem 
to indicate that where the tests are not on tool subjects, a test based on 
the subject matter taught is more valid than a standardized test not 
directly based on material covered. 

Difficulty of Items.—There are two methods of approach to the 
effect the difficulty of the individual items has on the validity of the 
test. These two approaches are mathematical and experimental. 
Symonds (59) has shown mathematically that to obtain the highest 
validity one should 
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‘Select these items as nearly as possible of the same difficulty, which can be 
answered by the average pupil with fifty percent accuracy, as estimated on the 
basis of the preliminary estimate of the advancement of the group.” 


T. G. Thurstone (6) in a very carefully worked out experiment in 
spelling showed that words of thirty to sixty per cent difficulty corre- 
lated highest with the criterion and that between forty-five to fifty- 
five per cent difficulty were the most valid. She concludes that 


“In the preparation of examination material, teachers should be more careful 
to include harder questions than they are accustomed to and not try to make their 
tests easy enough for a large percentage of their students to make grades between 
eighty to one hundred per cent.” 


Lewerenz (37) has shown the need of trying out a test before using it 
in educational experiments. He found that test items of supposedly 
equal value as judged by subject-matter and test construction experts 
varied tremendously and that sections of the test which were sup- 
posedly equivalent were very unlike in some cases. 

Effect of Intelligence.—The influence of intelligence on objective 
and essay tests has been studied by Corey (15), Eurich (22), Hesnard 
(28), and Johnson (31). Their findings were in agreement and indi- 
cated that the objective tests correlate higher with intelligence than 
does the essay type examination. 

Sampling.—It is assumed that the test which samples more of the 
knowledge of a given unit of subject matter is the more valid test. The 
relative values of the new and old type tests in obtaining the most 
complete sampling has been studied in a very careful manner by 
Talbott and Ruch (60). They found that on the average the essay 
question calls forth less than one-half of the pupil’s knowledge as 
measured by an objective test. The range of knowledge sampled 
varied from twenty-five per cent to seventy-seven per cent. Their 
final conclusion considering the per cent sampled and the length of 
time involved was that, ‘“‘The objective test is four to five times as 
good a device for sampling as is the essay.” 

Summary.—These studies on validity seem to indicate that: 

1. Objective tests, with the exception of the true-false type, seem 
to be slightly more valid than the essay examination. 

2. The completion test or modifications of it seem to be somewhat 
superior as far as validity is concerned to other types of objective tests. 

3. The true-false test appears to be the least valid objective type, 
but modified forms of it increase its validity. 
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4. Where teacher-made objective tests and standardized tests 
are measuring the same function, they are equally valid; however, 
when a test does not measure the material covered, its validity 
decreases. This is a most tentative conclusion and there is a need of 
more research on this problem. 

5. The individual items in a test should be of such difficulty as to 
be passed by about fifty per cent of the group, in order for the test to 
have its greatest validity. 

6. Before tests are used in educational experiments, they should 
be given to a similar group of pupils in order to study the validity of 
the items. Judgment of competent persons cannot be relied on to 
supply accurate estimates where the work is at all important. 

7. Objective tests correlate higher with intelligence tests than do 
essay examinations. 

8. The objective test is four to five times as good a sampling device 
as is the essay test. 

In general these conclusions do not overlap with Ruch’s (52, p. 290) 
except in so far as the earlier studies show that the new-type are at 
least as valid as the essay tests and that when the correlations between 
the essay and objective tests are corrected for errors of measurement, 
they measure approximately the same abilities. 


COMPARATIVE RELIABILITIES 


Essay vs. Objective—Recent studies on the comparative reliability 
of old and new type tests by Corey (15), Cheydleur (11) and Caldwell 
(9) are in agreement with a study summarized by Ruch (52, pp. 291-— 
306) that the objective tests have much higher reliability than do the 
essay type. A most comprehensive study iin the field of foreign lan- 
guage by Cheydleur (11)seems to indicate that there is a much closer 
agreement in reliability between an essay and objective type than is 
commonly thought to be the case. He found that on thirty-six 
final essay type examinations involving five thousand two hundred 
seventy-one students, the average reliability coefficient was .87. The 
average coefficient on the new-type or objective tests, he found to 
be around .94. This seems to indicate that essay type tests can be 
made sufficiently reliable if special methods are used in evaluating 
them. 

Various Types.—Jones (32) found that the control completion 
test and the true-false examination were equally reliable for the same 
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unit of time. Caldwell (9) found that the compietion test uniformly 
surpasses all other types and that the multiple choice questions seem 
to be rather erratic. Barton (2) found that the modified true-false 
test was more reliable than the usual method of giving true-false tests. 

Repeated Grading.—In studying the reliability of essay tests, there 
are two factors to be considered. First, what is the reliability of the 
test, as found say by correlating ‘‘odds” with ‘‘evens” and what is 
the reliability of the test when it is re-graded by the same teachers? 
The first method of determining reliability has indicated that in the 
essay method of questioning it is on the whole lower than in the objec- 
tive test. The second method of determining reliability was studied 
by Eells (21). He found that the correlation between teachers’ first 
judgments and the second judgments taken at intervals of eleven 
weeks was about .40 and he concludes “‘the fallibility of human judg- 
ment, even when it is the same human judging the same material, is 
strikingly demonstrated.” 

Difficulty of Items.—Symonds (58) in studying the reliability of 
spelling tests selected from various ranges of difficulty, found that 
“the test with the narrowest range of difficulty shows an error of 
measurement considerably less than a test with a wider range of diffi- 
culty.’’ There is also included in this article a very complete dis- 
cussion of the various factors influencing the reliability of all types of 
tests. 

Summary.—The studies of reliability seem to indicate that: 

1. Objective type tests have a higher reliability than do essay 
examinations. 

2. When various types of objective tests are considered on the 
basis of equal working times, there is very little difference in their 
reliability. 

3. Modified true-false tests seem to have a higher reliability than 
does the usual true-false test. 

4. Repeated gradings of essay type papers by the same people 
yield correlations of around .40. 

5. Tests, all whose items are of practically the same difficulty, tend 
to yield the highest reliability coefficients. 


SCORING METHODS 


The Continuity Test—The continuity test consists of items which 
the pupil is asked to arrange according to some order, usually chron- 
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ological. Recently there has been much discussion concerning meth- 
ods of scoring such examinations. This discussion has appeared in the 
School Review in the following order by Wilson (70), Nesmith (43), 
Wilson (71), Worcester (72), Cureton and Dunlap (16), John (30), 
and Odell (46), and elsewhere by Michell (42, p. 75). The articles 
include a discussion and suggestions for methods of scoring. Odell 
(46) studied all the suggested methods and concludes that the method 
suggested by Wilson (70), or the one used by Sangren and Woody in 
the Sangren-Woody Reading Test or the method suggested by Odell in 
his T’raditional and New-type Examination (p. 406-408) are all suffi- 
ciently valid to be used both from the standpoint of time and accuracy. 
The method suggested by Cureton and Dunlap (16) of finding p is the 
most defensible, but somewhat more time consuming than the other. 

Effect of Weighting—Many studies have been made on the effect 
of weighting items on objective tests. Studies by Douglas and Spen- 
cer (19), Bovee, Holzinger, and Morrison (4), and Noll (44) are suffi- 
cient to present the usual findings. All of these studies showed very 
high correlations (above .95) between raw scores and scores weighted 
by means of the difficulty value of the items. Peatman (48) has shown 
that it is unnecessary to weight true-false tests by the method proposed 
by Clark (12), involving analysis of items. Corey (14) proposed that 
items be rated by the importance of the question and showed differ- 
ences between scores weighed by different professors and the raw 
scores. He neglected to give the correlations between the various 
judges as to the importance of the items. Probably if he had corre- 
lated the various judges’ ratings of importance, he would have found 
at least as large differences between them as he found between their 
various weighted scores and the raw scores. 

Wheldon and Davis (69) propose a method of studying individual 
items on a true-false test which would hardly be practiced due to its 
time-consuming method. 

Use of Scales in Scoring Essay Examinations.—Ruch (52, pp. 106— 
111) in summarizing previous studies showed that the use of scoring 
scales for the traditional examination decreased the variability of 
scoring due to subjectivity by at least half. Odell (45) working from 
the same assumption and from the value of scales in measuring hand- 
writing, drawing, etc. worked out scales for rating pupils’ answers to 
standard essay questions. He showed that the use of such scales 
did not increase the reliability of teachers’ judgments sufficiently 
to warrant the expense of developing them. 
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Correcting for Chance.—The method of scoring to be used for true- 
false examinations has been the subject of controversy since the true- 
false idea was introduced. Ruch (52, 355-357) in summing up the 
previous work, suggests that the indications are that the do-not-guess 
instructions with the R-W method of scoring give the most valid results. 
The two studies by Dunlap, DeMello and Cureton (20) and Staffelbach 
(56) do not contradict these findings. Dunlap, DeMello and Cureton 
(20) show that the higher correlations obtained from the scoring of 
rights as compared with rights minus wrong where all items have not 
been answered are spurious. They found that the highest reliability 
is obtained by do-not-guess method, then next came the do-not-guess 
but fill in all blanks with either R or W depending on which predomi- 
nates in the answers already determined, and the lowest was the guess 
method. Staffelbach (56), in using the regression equation to deter- 
mine the weights to be given, found that the true-false tests should be 
scored R + %{ 90 — W, where O is the number of items omitted. In 
summarizing his study, he concludes that 


‘‘Here then appears the tendency to reward the pupil who not only knows what 
he knows, but also knows what he knows not, when such happens to be the case.”’ 


Summary.—The studies on scoring methods seem to lead to the 
following conclusions: 

1. There are several methods equally valid for scoring continuity 
tests, but all of them are somewhat complicated. 

2. Weighting of items according to their difficulty has little effect 
on reliability or validity and is so time-consuming as to be impractical. 

3. Use of special evaluating methods, such as throwing into five 
or more piles, counting number of ideas expressed and the like, 
decreases the variability in the correction of traditional examinations. 

4. The use of scales for judging standard essay questions does 
not increase the reliability of the judgments sufficiently to be of value. 

5. The use of do-not-guess instructions and the R-W formula for 
scoring appear to be the best procedure to use with true-false tests. 

6. The R-W formula, however, appears to over-correct for guessing 
somewhat. 


SPECIAL PROBLEMS PECULIAR TO THE TRUE-FALSE TEST 


Determiners.—Specific determiners are factors that influence the 
form of statement, that is, whether the statement will be more often 
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true than false or the reverse. Most of the work on determiners has 
been done by Brinkmeier (5, 6, 7), Keys (6), Ruch (7) and Weidemann 
(67, 68). Weidemann (67) first showed that statements involving the 
words always or never, or including causes or reasons were twice as 
often false as true and that indications of degree or comparison were 
twice as often true as false. Brinkmeier extended the factors and 
studied the problem much more extensively. She reports her main 
work with Ruch (7). There they list a large number of statements 


which act as specific clues to the truth or falsity of the sentences. 
They conclude with 


‘* A studied effort to balance approximately the number of true and false state- 
ments containing any one individual determiner will aid in eliminating the influence 
of the determiner.” 


Brinkmeier (5) also shows that long statements, over twenty words, 
tend to be true in about three-fourths of the cases. 

Circumstantiality of content and phrasing which are usually true 
are apt to be recognized by students and add little to the value of the 
test. This has been demonstrated by Brinkmeier and Keys (6). 
Weidemann (68) presents evidence which leads to the conclusion that 
the omission of an item by a student is not influenced by the truth or 
falsity of the statement. 

First Impressions.—Is a pupil’s score raised if he goes back over 
items and makes what changes he thinks are advisable? This prob- 
lem has been studied by Lowe and Crawford (38) and Mathews (41). 
The findings of both studies are in agreement that there is a slight 
advantage to change the response when a second impression warrants 


it. Quoting from the study of Lowe and Crawford (38) which is the 
more complete study, they say 


‘‘Second impression is not different from first impression in nine cases out of 
ten but in the one case is enough to make a significant improvement in scores.” 


There is evidence in both of the studies which would lead one to think 
that it might be advisable for pupils to study their own tendencies 
and decide which is the better method for them. 

Oral vs. Printed—Studies made by Stump (57), Lehman (36), 
Jensen (29) and Tiegs (62, p. 470) seem to indicate that there is little 
or no difference in either reliability, validity, or average scores when 
true-false examinations are presented either orally or in mimeographed 


¥ 











30 The Journal of Educational Psychology 


form. There is evidence that the two methods of presentation are 
not measuring the same mental processes and Stump (57) shows 
that intelligence affects the oral method slightly more than it does 
the written. Much individual difference exists, so that some pupils are 
superior in the oral while others are superior in the written. However, 
the oral method promises much in the way of economy. 

Guessing.—Granich (25) proposes a method of inserting a certain 
number of questions to which the pupils could not possibly know the 
answers and warning the pupils that such questions existed. He 
offers evidence that this decreases the guessing of the pupil. It would 
appear that the previously mentioned modified types suggested by 
Bayless and Bedell (3) and Barton (2) would be of more value in 
eliminating guess work. 

Summary.—The data supplied by the studies lend themselves to 
the following interpretations: 

1. There are many specific determiners by which the truth or 
falsity of statements can be recognized. These should be known and 
used equally in true and false statements. 

2. Second impressions tend to be more nearly right than wrong. 

3. There is practically no difference in the reliability or validity of 
true-false tests when presented either orally or in mimeographed form. 

4. Several methods are proposed by which it is possible to decrease 
guessing. 

Other studies dealing with special problems have been summarized 
by Ruch (52, pp. 358-368). The most important results therein 
presented is that the net suggestion effect of the true-false statement 
is positive instead of negative. It is thought by the present writers 
that this positive benefit would be increased if the pupils were allowed 
to correct their own papers. It is probable that with pupil correction 
of papers negative suggestion effects would be practically eliminated. 
This is merely the opinion of the writers and is not substantiated by 
any evidence. 


STUDENTS’ ATTITUDES TOWARDS TESTING 


Previous studies presented by Ruch (52, pp. 130-137) indicate that 
tests and examinations are equally unpopular with teachers and pupils. 
In more recent studies there appear indications that it is possible to 
change this attitude to that of a more favorable one. Scheidemann 
(53) and Turney (63), after using a number of short objective tests in 
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psychology, found that in general students preferred the use of such 
check-ups and thought that the tests helped them in their work. 
Cocks (13) states that the pupils like the tests when they were allowed 
to correct their own papers. 

Greene (26) found that the average amount of confidence of pupils 
has a fairly constant relationship to the average proportion of right 
answers. 

Summary.—These studies indicate a possibility of making the 
testing situation such that the pupils like instead of dislike it. How- 
ever, to bring this condition about the pupils must feel that the primary 
purpose of testing is to be of help and assistance to them, not merely a 
device that the teacher can use to find out what they do not know. 


NEW TYPES OF TESTS 


There have appeared in the literature several suggestions for new- 
type tests, some of which are modifications of the present types while 
others are new departures. 

Modifications of the true-false tests as suggested by Barton (2) 
and Bayless and Bedell (3) have been discussed in the section on com- 
parative validities. Curtis and Woods (17) suggest a modified form of 
the multiple choice tests. They suggest the regular form of the multi- 
ple choice test with the addition of a blankline. Ifthe correct response 
does not appear as one of the choices the pupil writes it on the blank 
line. They offer evidence that it is as valid and as reliable as the 
usual form. Probably its greatest value is that it stimulates more 
thinking on the part of the pupil. 

A means of measuring synthesis has been proposed by Fay (23). 
He includes several ideas in the matching test instead of one. One 
item illustrating the method taken from his article is: 


(c) John Dewey—to be matched with 


Activity of child; industrial society ; cooperation; democracy; understand 
child. 


The difference between his suggestion and the usual form of match- 
ing test is that several associations are grouped in one item. This 
probably makes the item easier because one is not dependent upon a 
single clue. This type probably is valuable in the measurement of 
certain objectives, especially in the social studies. 

Another elaboration of the matching technique has been suggested 
by Tyler (64). He terms his suggestion ‘‘The Master List.” The 
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main difference between his test and the usual matching test is in 
material included, not in the form of the test. His suggestion appears 
significant and he has presented a very complete analysis of possibilities 
for its use. His presentation should be read by all those who are mak- 
ing serious use of new-type tests to measure the objectives of the 
course. 

Another example of improved use of the present form of tests to 
measure other than mere facts is given by Carpenter (10). He includes 
three tests in the field of English literature measuring appreciations. 
The test on Poe should be familiar to all teachers of English who wish 
to improve their testing, as it is a most significant departure from the 
usual type of objective English Test. 

A method of response has been explained by Mason (40). The 
test consists of a series of statements in front of which is A—D—U. 
The student encircles the correct response depending upon his agree- 
ment, disagreement or uncertainty. If he disagrees with the state- 
ment, he is to explain his viewpoint. There is little difference between 
this form and the true-false-doubtful test. 

Atkinson (1) has reported on a test which allows for a student’s 
self appraisal of his work on unit tests. The ideas embodied in this 
new type are those of C. A. Buckner.* The most interesting idea is 
the proposal that the form of the test be discarded and the questions 
arranged in the order in which they occurred in the learning situation. 
The questions are then put in whatever form seems best. This means 
that a true-false statement might appear next to a matching question, 
etc. There would seem to be possibilities of confusion in this pro- 
cedure which should be studied rather carefully. He offers evidence 
on reliability which suggests that the organization of tests by learning 
is as reliable'as by the form of the question. There appears possi- 
bilities in this type of test, but much more research on it needs to be 
reported than is at present available. 

Summary.—Several new or modified types have been proposed 
most of which appear to be valuable contributions to objective testing. 
There is need of more research on most of the proposed types before 
they are widely used. 


MISCELLANEOUS PROBLEMS 


Spelling.—One problem which has recently been studied somewhat 
is the various methods of testing spelling. Gruler (27) offers evidence 
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to show that the correcting of a wrong response is a more effective 
learning device than is the usual method of oral recall or the multiple 
choice method. His spelling test would be as follows, where the 
correct spelling would be written at the end of the line. 





1. Can you (recommend) this man? 


Phillips (50) offers evidence as to the effectiveness of the two 
response method when compared with the oral recall. The oral recall 
appears to be more reliable. Pintner, Rinsland and Zubin (51) 
studied the following two types of spelling tests in comparison with 
the oral recall. 


1. Which of the words is wrongly spelled? 

1. the 2. when 3. wil 4. same 5. and 
2. The boy played in the yard. 

1. bol 2. bal 3. boll 4. ball 5. baul 





They found that both types when compared with the oral recall method 
are sufficiently valid and reliable for use. The second type was slightly 
superior to the first. 

The evidence offered in the two studies indicates that probably 
the best method for use in survey and achievement tests where it is 
desired to make conditions uniform is to use the written recall (as 
suggested by Gruler), even though it takes more time to score. 

Self-appraisal.—Atkinson (1) found that “Students of high ability 
were better able to appraise their possession of knowledge than were 
students of low ability.”” This seems to collaborate somewhat the 
evidence of Staffelbach (56) that pupils should be rewarded for what 
they know they do not know as well as for what they do know. 


IMPORTANT DISCUSSIONAL REFERENCES 


There are several recent articles and books which are significant 
and should be familiar to the student of objective tests. 

In the field of history Krey (34) discusses the advantages and dis- 
advantages of new-type tests in history on the college level. Kelley 
(33) stresses the need for objective tests to measure attitudes and 
principles. He presents many challenges to students who wish to work 
in this field. Gibbons (24) gives examples of attempts to measure 
other phases necessary in history than those of mere fact finding. 

The development of cooperative programs of testing have been 
reported by Gibbons, Orleans and Sealy, and Tyler. Gibbons (24) 
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reports on the development of a city-wide testing program in the 
social studies of the secondary school. She also illustrates rather 
well the development of tests in connection with curriculum revision. 
Orleans and Sealy (47) illustrate the development of uniform objective 
testing in arural area. Their work was done in the elementary school. 
Tyler (65) outlines the development of an objective testing program 
on the college level. The most important phase of Tyler’s article 
is the outline for the development of new-type tests for a college 
course. 


The basic work in this field is Ruch’s The Objective or New-type 


Examination in which is presented a most adequate treatment of 
objective testing. 
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THE PERSONAL EQUATION IN RATINGS: 
II. A SYSTEMATIC EVALUATION 


HERBERT 8S. CONRAD* 


Institute of Child Welfare, University of California 


Although ratings have been subjected to much criticism on meth- 
odological grounds, their use in educational and applied psychology 
appears quite extensive and surprisingly persistent. This widespread 
use is defensible, on the positive side, by the refinements which 
criticism of ratings has encouraged; and on the negative side, by proof 
that some of the objections to ratings are probably of little quantita- 
tive significance. The present paper is concerned with the evaluation 
of that frequently mentioned defect of ratings, the ‘personal equa- 
tion.”” For purposes of clarity and quantitative treatment, the 
personal-equation is defined as referring to spurious differences between 
the means or the standard deviatiors of two judges’ ratings; the 
personal-equation effect is defined as the influence of spurious differences 
in means and standard deviations, upon the correlation of ratings with 
an adequate criterion. 

Let us suppose that a given sample of children has been independ- 
ently rated by two judges for any given trait. The efficiency of the 
two judges, A and B, is determined by the correlation of their ratings 
with a true criterion, C. For simplicity of presentation, let us assume 
that the correlations between the criterion and each judge’s ratings 
are the same; that is, rac = rac. If no personal equation existed 
between judge A and judge B, the means of their ratings would be 
equal, and also the dispersions. Under such circumstances, the 
inclusion of A’s ratings in the same contingency table (or correlation- 
plot) as B’s, would not alter the correlation between ratings and true 





*I am indebted to Miss L. Hutson, Statistical Assistant at the Institute of 
Child Welfare, for the computation and checking of Tables I-III of this paper. 
Professors H. E. Jones and L. L. Thurstone have kindly read the manuscript. 

¢ Although much used, the “‘ personal equation”’ has seldom been defined except 
by implication. Sometimes it would appear that all deficiencies of ratings are 
loosely included within the term. The present definition is believed to be suffi- 
ciently orthodox, tracing from original astronomical usage of the term; it accords 
with the definition implied by the army psychologists in their careful preliminary 
work on the Army Alpha intelligence test (3, pp. 438-439). For a fuller discussion 
of the term ‘personal equation,” see reference 2. 
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criterion: this correlation, for the combined contingency table, would 
equal the correlation in the separate, individual tables (namely, 
Tac OF Tsc). This suggests that the importance of the personal 
equation in any particular case may fairly be measured by the change 
in correlation with a true criterion, caused by the combining of corre- 
lation-plots.* That, after all, is what the user of ratings by different 
judges wants to know: to what extent would disregard of the personal 
equation affect validity, as measured by correlation with an adequate 
criterion? 

Case 1.—The Judges Differ in the Means of Their Ratings.— 
Assuming that the standard deviations of the judges’ ratings are equal, 
we ask the effect of a difference between the means of their ratings. 
Suppose that the mean of judge B’s ratings differs from that of judge 
A’s by .4 of a standard deviation. Then, if the correlation between 
judge A’s ratings and the true criterion is .35, and also the correlation 
between judge B’s ratings and the criterion is .35; the difference of 
.4 SD between the means of their ratings reduces the correlation 
in a combined correlation table to .343 (see Table I, first entry). 
Reading from Table I, we see that a mean difference of 1.00 SD 
would reduce an initial correlation of .35 to .313; it would reduce an 
initial correlation of .80 to .716. Evidently, the effect of the personal 
equation increases, as the validity of each individual judge’s ratings 
increases (see Fig. 1). When the initial correlations with a criterion 
are zero, no significant personal-equation effect could exist: judge A’s 
ratings would be worthless to begin with; judge B’s ratings would 
be equally worthless; and no further lowering of the validity of the 
ratings (as measured by correlation with a criterion), could enter 
through the personal equation. 





* The correlation in the combined correlation plot, rzy, is given generally by 
the formula (reference 3, p. 439): 


N.iN 
Nose grey = Nitaytyteyy, + NeoreysFew:. + a (a — 21)(91 — 9s) 
In this formula, z may be considered to refer to ratings, y to the criterion. For 
our special case, involving a single sample, the mean criterion scores are identical 
(9:1 = 92); also, oy, is the same as o,,, and N; (the number of cases) is the same as 
Ns. WN of course equals N, + Nz. By assumption, rz,y, = rz, Using the 
notation and assumptions of the present paper, the formula may be rewritten: 
20 44B0CT = oaoctac + oBecTBC 
Factoring out oc, we have: 


20 44BT = OaTac + COBrBc. 
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We have not computed the personal-equation effect for correlations 
initially lower than .35: for it is obvious that the effect would be very 
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Fra. 1.—Illustrating the decline of correlation in the combined contingency table, as the 
discrepancy between the means of two judges’ ratings increases. 
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Fig. 2.—Illustrating the decline of correlation in the combined contingency table, as the 
discrepancy between the SD’s of two judges’ ratings increases. 


small; and it is questionable whether ratings with a validity of less 
than .35 may legitimately or profitably be employed. We have not 
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computed the personal-equation effect for correlations initially higher 
than .80, because ratings rarely agree so highly with any criterion. 


TaBLE I.—THE PERSONAL-EQUATION EFFECT BETWEEN Two JUDGES, IN TERMS 
oF CORRELATION. CasE 1 











Difference between the mean of judge B’s ratings and the mean 
rac equals rec of judge A’s ratings (in terms of SD) 
equals 

.400 . 600 .80e 1.000 
.35 343 .335 .325 .313 
.40 . 392 .383 371 .358 
45 .441 .431 .418 .403 
.50 .490 .479 .464 .447 
.55 .539 .527 511 .492 
.60 .588 .575 .557 .537 
.65 .637 .623 .604 58) 
.70 .686 .670 .650 .626 
.75 .735 .718 .696 .671 
. 80 . 784 . 766 .743 .716 

















The figures in Table I state what would happen 7f certain dis- 
crepancies in means existed. In practice, it seems probable that the 
effects pictured in the upper right- and lower left-hand portions of the 
table would occur most frequently; inasmuch as the larger spurious 
differences in means appear more likely to occur under conditions 
causing the correlations with a criterion to be low, and vice versa. 
It will be observed that the coefficients in the upper right- and lower 
left-hand portions do not differ much from the initial r’s in the single 
correlation plot, as given by the Y-axis of Table I. 

Case 2.—The Judges Differ in the Standard Deviations of Their 
Ratings.—Assuming that the means of the judges’ ratings are equal, 
we ask the effect of a difference between the standard deviations. 
Suppose that the SD of judge A’s ratings is 1.00, and the SD of judge 
B’s is .80. The effect of this discrepancy in standard deviations is 
to reduce a correlation of .35 in the individual contingency tables, to 
a correlation of .348 in the combined contingency table* (see first 





* The same result is obtained if the SD of A’s ratings is 1.00, and that of B’s 
ratings is 1.25 (instead of .80). It is an interesting fact that when (as in the 
present case) the judges do not differ in the means of their ratings, then for any 
given r in the individual contingency tables, the coefficient in the combined con- 
tingency table is a function of the ratio of the smaller SD to the larger (or the ratio 
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entry in Table II). Reading in Table II, we may observe again (as 
in Table I), that the effect of the personal equation increases, as the 
validity of each individual judge’s ratings increases. This is more 
clearly portrayed in Fig. 2. 


TaBLeE I].— THe PERSONAL-EQUATION EFFECT BETWEEN Two JUDGES, IN TERMS 
oF CORRELATION. CasE 2 

















Ratio of judge B’s SD to judge A’s 
Tac equals rz, equals 

.80 65 .50 
35 .348 .343 . 332 
.40 .398 .391 .379 
45 .447 .440 .427 
.50 .497 .489 .474 
55 547 . 538 . 522 
.60 .596 .587 .569 
.65 .646 .636 .617 
.70 .696 .685 .664 
75 .745 . 734 711 
80 .795 . 782 . 759 








Case 3.—The Judges Differ Both in the Means and Standard 
Deviations of Their Ratings—We assume here, as before, that each 
single judge’s ratings correlate the same with the true criterion 
(rac = Tac). We shall use the statistical constants of the ratings of 
one judge as reference-values for the ratings of the other: Judge A’s 
mean is taken as 0 SD, judge B’s mean is expressed as so many SD’s 
above that of judge A (the SD of A’s ratings being the unit); simi- 
larly, judge A’s SD is taken as 1, judge B’s SD is expressed as a ratio 
of judge A’s. 

Accepting these conventions, we read in Table III that when 
M, differs from M, by .4 SD, and when SD; = .8 SDa,, an initial 
correlation of .35 for each individual judge is reduced to .340 in the 
combined contingency table. An initial correlation of .40, under the 
same conditions, would be reduced to .388; an initial correlation of .80 
would be reduced to .776. The values for other combinations of 





of the larger SD to the smaller). This may readily be proved from the formula 
given in the third footnote of this article. In the present case, .8:1.0 = 1.00:1.25; 
hence the results for SDg = 1.25 and SDg = .80 are identical. 
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discrepancies between A’s and B’s means and standard deviations 
may similarly be read from Table III.* 

It is difficult to state certainly whether the lower or the higher 
personal-equation effects in Tables I-III are the more common in 
practice. A previous study' has indicated that the influence of the 
personal equation in certain Army ratings is practically nil. Another 
study has shown that even under circumstances calculated to inflate 
the personal-equation effect to a maximum, the effect still remains 
surprisingly small and practically negligible. If these two studies 
may be trusted, it would appear that the personal-equation effect has 
in the past been greatly over-estimated. There certainly can remain 
little doubt that improvement in the form and content of rating scales, 
in the supervision of ratings, and in the training of judges, can keep 
personal-equation effects at a very low figure. What little effect 
may remain will doubtless be sufficiently small to be ignored or 
neglected; or, in the most refined work, the effect could be minimized 
by appropriate corrections. 

We have neglected to consider differences in the skewness of each 
judges’ ratings. The influence of this factor cannot be evaluated by 
our correlation criterion, since the correlation coefficient depends 
(apart from linearity of regression) entirely upon means and standard 
deviations. It seems more than likely, however, that if differences 
between judges’ means and standard deviations do not usually exert 
a significant influence, then even less do differences in skewness 
(which are partly reflected through differences in means and standard 
deviations). 

Examination of Tables I-III has suggested that the personal 
equation may be less of a bogey than it is usually considered. Our 
experimental investigations’? have tended to confirm this view. 
Only further experimental studies, however, can definitely settle the 
question. 
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* The statements of the previous footnote do not apply here, where it is assumed 
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make the personal-equation effect appear as large as possible, rather than as small. 
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SEX DIFFERENCES IN INTELLECTUAL ABILITIES 


J. D. HEILMAN 
Colorado State Teachers College 


The results of almost all of the statistical investigations on sex 
differences in intellectual functions are unconvincing on account of 
the nature of the data employed and the inadequacy of their statistical 
treatment. An examination of nine articles on sex differences in 
various intellectual traits revealed many deficits in spite of the fact 
that they appeared during the past two years (1929 and 1930) in 
technical journals. 

Six of the nine investigations made no statements about the relative 
amount of training of the sexes; two, which dealt with sex differences 
in learning ability, gave initial tests to equalize the knowledge 
possessed; and one stated that all of the pupils were taking their first 
course in the subject in which the sexes were compared, but that the 
several groups had studied different textbooks. Six of the studies 
utilized grade rather than age groups, the number of grades varying 
from two to six. This procedure results in bad sampling, especially 
in the upper grades. Only two of the investigations equalized the 
average ages of the sexes. One of the nine studies took the socio- 
economic status into consideration but failed to allow for the sex 
differences found in it; two of the remainder stated that all of the 
subjects resided in small towns. 

Only four of the nine studies present a standard error of the differ- 
ence between the averages of the sexes. None of the articles discuss 
the nature of the distributions although it is known that an adequate 
standard error presupposes a normal distribution. In one of the 
articles a graphic representation of the distributions is given but the 
high degree of their skewness is not taken into account in the discussion 
of the results. In these distributions the highest scores were removed 
from the mode by five times as many units as the lowest scores. For 
each of the sexes the number of subjects in six of the studies varied 
from twenty-three to one hundred and in the remaining three from 
one hundred fifty-two to three hundred twenty-six. 

If it be assumed that to make a reliable study of sex differences in 
intellectual functions, the sex groups should be equalized on the basis 
of training, life age and socio-economic status; that for each age there 
should be at least one hundred subjects of each sex; that the distribu- 
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tion should be normal; and that there should be a statement of the 
standard error of the difference, then four of the nine studies are 
deficient in five of these six requirements, three in four of them, and 
two in three of them. While the investigation to be reported in this 
article is not free from faults, it is believed that it is far superior to 
the nine investigations discussed as well as to almost all others made 
to determine sex differences in intellectual traits. 

In this investigation the sexes are compared for their average 
standing in life age, the amount of training as measured by school 
attendance, social and economic factors, mental age, educational age, 
and achievement in a number of different school subjects. For a 
few of these items differences in relative as well as in absolute varia- 
bility are considered. 

The subjects of the investigation were ten-year-old children from 
the public schools of Denver, Colorado. The age range of the children 
was only nine months, extending from ten years and one month to 
ten years and ten months. All of the children attending forty-eight 
different schools were included in the sample, provided they had 
American parentage, fell within the prescribed age limits and had 
sufficiently complete records on the several items of the investigation. 
All but fifty of the children had attended only the Denver schools. 
The schools were selected from communities with an average social 
and economic status. 

The Stanford revision of the Binet scale was.used to find the mental 
age. All of the tests were administered by two women! who had 
received special training for the purpose. The scoring was checked 
by a third person who had done most of the testing in the Denver 
schools. Some of the checking was done by having one of the testers 
check the other tester’s work. The testing period lasted about eighteen 
weeks, but by means of the IQ the mental ages were all corrected to a 
single date, the first day of March. 

The educational age was based on the results of the Stanford 
Achievement Test. This test was given during the two weeks preced- 
ing and the two weeks following the first of March. No correction 
was made for differences in the time of administration, because the 
probable error of the test for ten-year-olds is about 1.5 months. More- 
over, as about an equal number of boys and girls were tested at the 





1For a statement of acknowledgements, see chapter two of the Twenty- 
seventh Yearbook of the National Society for the Study of Education. 
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same time, such a correction would have had no effect upon the sex 
differences of our samples. Experienced students under the super- 
vision of an assistant in psychology did all of the scoring. The socio- 
economic status of the home was measured by a revision of the 
Chapman-Sims Socio-economic Scale. Life age and school attend- 
ance were obtained from carefully kept school records. 

According to the original plan one thousand children were to be 
included in the study, but it was begun somewhat too late in the school 
year to obtain all of the data for the entire number. On life age and 
the scores made in the different school subjects data are available for 
four hundred eighty-two girls and four hundred sixty-four boys. On 
mental age, socio-economic status and school attendance the data 
for four hundred thirty-one girls and three hundred ninety-seven boys 
are used in this study. For socio-economic status and school attend- 
ance data were obtained for the larger groups also but were limited 
to the smaller groups for another purpose and were not taken off the 
original records for the larger groups. 

In Table I the means and medians in life age, school attendance, 
socio-economic status, mental age, and educational age of a sample of 
four hundred thirty-one girls are compared with similar measures for 
a sample of three hundred ninety-seven boys. In life age the differ- 
ences between the means and medians are negligible. For the remain- 
ing four measurements the means and the medians of the boys are 
higher than those of the girls, excepting those for school attendance. 
If it be assumed that the significance ratio, which is the quotient of 
the obtained difference divided by the standard error of the difference, 
must be 2.78 or larger before it can be said that the true difference is 
above zero with practical certainty, then there is no practical certainty 
that the true difference for any one of these comparisons is above zero. 
The highest significance ratio is 2.4. This was found for the difference 
between the medians in mental age. The next highest ratio is 1.7 and 
was found for the difference between the means for mental age. The 
boys have the higher mean and the higher median for mental age. It 
should be stated that the computation of the standard error of the 
difference between the medians was based on the quartile deviation. 

The amount of skewness of all of the traits listed in Table I is 
relatively low. The highest degree of skewness was found for the 





1 Heilman, J. D.: A Revision of the Chapman-Sims Socio-economic Scale. 
Journal of Educational Research, September, 1928. 
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girls’ distribution in mental age. Moreover, the largest difference in 
the amount of skewness was found for mental age. Perhaps these 
facts account in part for the closer approximation to a reliable difference 
in mental age than for the other traits. 

Differences between the means and medians for life age, scores in 
individual school subjects and the composite scores for all of the school 
subjects combined are given in Table II. The individual school sub- 
jects include reading, arithmetic, nature study and science, history_ 
and literature, language usage, and spelling. Results are also given 
for paragraph, sentence, and word meaning in reading, and for compu- 
tation and problem-solving in arithmetic. The standard errors of the 
differences between the means and the medians for all of these measure- 
ments as well as the degree of skewness for each distribution are listed 
in the table. For these results the number of girls was increased to 
four hundred eighty-two and the number of boys to four hundred 
sixty-four. 

The differences between the means and the medians in life age are 
so small as to be negligible. In spelling the obtained differences are 
so large as to justify the statement that the true differences are certain 
to be above zero, They are in favor of the girls. There are two other 
differences for which the significance ratio is above 2.78. They are 
the differences between the means for history and literature and for 
language usage. The significance ratio for the difference between 
the medians in history and literature is 2.60, which is somewhat too 
small for complete reliability. Moreover, the degrees of skewness 
for the distributions in history and literature are so high as to render 
the obtained values unreliable. This test was too difficult for our 
subjects, the scores being piled up at the lower end of the distributions. 
Nine girls and twenty-two boys made negative scores. The effect of 
this skewness was to decrease the size of the measures of variation on 
which the size of the standard errors in part depends. The standard 
errors are, therefore, smaller than they would have been for normal 
distributions, and this tends to increase the size of the significance 
ratios. The significance ratio for the difference between the medians 
in language usage is only 1.68. For nature study and science in which 
the boys are superior, the significance ratio is 2.64 for both of the differ- 
ences. In problem-solving the significance ratio for the difference 
between the means 2.67 and for the difference between the medians 
it is 2.71. While these ratios are high there is still some chance that 
the true difference may be zero or in favor of the other sex, in this 
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case the girls. The next highest significance ratio is 2.1. This was 
found for the difference between the means in arithmetical compauta- 
tion in which the girls excel. Out of the twenty-four comparisons, not 
including life age, the boys excel in nine and the girls in fifteen. 

We may conclude by saying that the differences in spelling are 
large enough to make it practically certain that the true difference is 
above zero; that the differences in language usage, nature study and 
science, the problem-solving are so large as to approach closely the 
requirements for practical certainty ef true differences above a zero 
value; and that the remaining differences are too low to indicate any 
reliable sex differences. 

For the smaller groups of boys and girls, there are listed in Table 
III three measures of absolute variation in life age, school attendance, 
socio-economic status, mental age, and educational age. The three 
measures are the standard deviation, the quartile deviation and D. 
The standard errors of these measures are also given. In computing 
the standard errors of the quartile deviation and of D, the quartile 
deviation and the D of the distributions were used, respectively. The 
differences between these measures of absolute variation for the sexes 
are too small to give much confidence in the belief that the true differ- 
ence lies above zero. The two highest significance ratios were found 
for school attendance and educational age. They are 1.5 and 2.4 
respectively. The differences in life age are negligible. Of the remain- 
ing twelve comparisons, the boys have the larger variation in eight. 

For life age, mental age, educational age, and school attendance 
measures of relative variability were computed by dividing the stand- 
ard deviation of the distribution by its mean. The differences between 
these measures are low, the largest significance ratio being only 1.33. 
This ratio was found for mental age. 

For the larger groups measures of absolute variability, which are 
similar to those listed in Table III, are given in Table IV. These 
measures pertain to life age, scores in the individual school subjects, 
various phases of some of these subjects, and the composite score for 
all of the subjects. The only difference which was large enough to 
guarantee the necessary confidence that the true difference was above 
zero, was the one found for history and literature. However, on 
account of the high degree of skewness found for these distributions, 
the difference cannot be accepted as reliable. The significance ratio 
for the difference between the standard deviations of paragraph mean- 
ing is 4.74, but for the differences between the quartile deviations and 
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the D’s the ratios are only 2.5 and .7 respectively. As all of the other 
differences give less confidence that the true difference is above zero, it 
is best to assume that none of these differences are reliable. Not 
counting life age, the boys have the larger obtained variation in twenty- 
nine of the thirty-six differences. 


Tas.Le V.—Sex DIFFERENCES IN THE TENTH AND NINETIETH PERCENTILE POINTS 
or A SAMPLE OF Four HunpRED THIRTY-ONE TEN-YEAR-OLD GIRLS AND 
A SAMPLE OF THREE HUNDRED NINETY-SEVEN TEN-YEAR-OLD Boys IN 
Lire Acg, Scoot AtTrenpDANcE, Socio-Economic Status, MENTAL 
AGE AND EpvucATIONAL AGE 





Chances in 
one hundred 











: SE true difference 
P 10 P. 90 difference is above zero 
P 10 and P 90 
P19 Po 
Girls...... 122.02) 129.09 
Eitvens ta mente... Boys...... 121.99} 129.07 
Differences .03 .02 .37 53 52 


ee 549.29) 796.51 





Differences| — 14.97 2.68 12.62 88 58 


ee 459.17) 525.33 


ae a ee Boys...... 459.90) 528.48 





Differences} — .73) —3.15 4.15 61 78 


ee 110.02) 157.19 


Mental age in months. eer 110.25) 159.88 





School attendance in ™ uhh, 534.32) 798.19 


Differences} .23) —2.69 2.14 54 90 


ee 121.93} 156.75 
Boys...... 121.24) 159.48 





Differences .69| —2.73 1.61 66 96 























In Table V the smaller groups are compared for differences between 
the tenth and the ninetieth percentile points in life age, school attend- 
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ance, socio-economic status, mental age and educational age. None | 
of these differences are statistically significant. The highest sig- | 
nificance ratio is only 1.7. It was found for the ninetieth percentile 


TaBLE VI.—Sxmx DIFFERENCES AT THE TENTH AND NINETIETH PERCENTILE Points 
or A SAMPLE OF Four Hunprep E1Guty-tTwo TEN-YEAR-OLD GIRLS AND A 
Samp.e or Four HunpRED Sixty-FouR TEN-YEAR-OLD Boys In Lire AGB, ) 
Scores in INpDIvipvAL ScHoo. SuBJECTS AND THE ComMposITE ScORE t 
or ALL SusBsEcTs 





Chances in one 
SE differ- | hundred true 
ence for difference is 









































Pw above zero 
and Po 
P10 Pw 
Girls...... 121.99} 129.10 
Sitiaciieitis manag: - BG 122.02} 129.16 
Differences}— .03) — .06 35 54 57 
Girls...... 31.87| 63.47 
Composite scores... . - Peiiswhx 30.23} 64.47 
Differences 1.64; —1.00 1.41 88 76 
Girls...... 43.74) 82.33 
Reading: paragraph|)Boys...... 42.17| 82.37 
Ee 
Differences 1.57; — .04 1.67 83 51 
Girls...... 23.48| 58.29 : 
Reading: sentence)])Boys...... 22.49) 57.67 
I 
Differences .99 .60 1.57 74 55 
Gisis...... 27.60) 56.44 
Reading: word mean- _ Paulo 25.01); 58.03 
ae 
Differences} 2.59) —1.59 1.25 98 90 
Girls...... 101.10) 190.53 
Reading: total....... - eee 90.78) 192.75 
Differences} 10.32} —2.22 4.12 99 71 
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TaBLeE VI.—Continued 
































Chances in one 
SE differ- | hundred true 
P Pp ences for | difference is 
” - Pro above zero 
and Pw 
Px» Pw 
Girls...... 64.57| 121.76 
Arithmetic: computa-|)} Boys...... 63.35) 121.32 
A IS 
Differences} 1.22 44 2.07 72 58 
ss oh 34.64) 79.67 
poner an ning. o bak eal 37.14) 83.14 
Differences|— 2.50/ —3.47/ 2.04 | 89 | 96 
Girls...... 104.60} 197.83 
Sites v cael 101.69) 201.53 
Differences 2.91] —3.70 3.96 77 82 
Girls...... 16.24; 55.83 
Nature study and/)Boys...... 19.54; 59.02 
acience............ 
Differences|— 3.30] —3.19 1.69 | 98 | 97 
Girls...... 7.09) 39.96 
History and literature - Kaew 5.46; 53.46 
Differences) 1.63|-13.50 1.75 | 82 | 100 
Girls...... 7.47, 38.83 
: ea a ical 4.00} 34.70) 
Differences 3.47 4.13 1.54 99 99.6 
ee 66.73) 139.83 
| 
en es Boys...... 54.88] 129.30 
Differences} 11.85) 10.53 3.04 100 100 























points in educational age, the higher percentile having been found for 
the boys. 
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In Table VI the larger groups are compared for differences between 
the tenth and the ninetieth percentile points in the distributions for 
life age, scores in the individual school subjects and the composite 
scores for all of the school subjects. Practical certainty for a true 
difference above zero was found for the tenth and ninetieth percentile 
points in spelling; also for the ninetieth percentile points in history 
and literature, but this finding can not be accepted on account of the 
presence of skewness in the distributions. The significance ratio for 
the differences between tenth percentile points in paragraph meaning 
and language usage is 2.4, and the ratio for the difference between the 
ninetieth percentile points in language usage is 2.68. The next lower 
ratio is two and applies to two differences. All but two of the remain- 
ing ratios are considerably smaller. At the tenth percentile point and 
at the ninetieth percentile point, neither sex shows a reliable superiority 
in the functions measured. The boys have the lower tenth percentile 
scores in ten out of the twelve comparisons, and they have the higher 
ninetieth percentile points in seven out of the twelve comparisons. 
However, one can not be practically certain that a true difference 
exists for these measurements apart from the subject of spelling. 

For a carefully selected sample of four hundred eighty-two girls 
and four hundred sixty-four boys, a study was made of the sex differ- 
ences in the variation and the average attainment in life age, reading, 
arithmetic, nature study and science, history and literature, language 
usage, and spelling. Differences in variation and average attainment 
were also studied for a composite score of all of the subjects and for 
certain phases of reading and of arithmetic. In the latter are included 
paragraph, sentence and word meaning in reading and computation 
and reasoning in arithmetic. 

Average attainment is expressed in terms of both the mean and 
the median, and variation in terms of the standard deviation, the 
quartile deviation and D, the difference between the tenth and nine- 
tieth percentile points. Sex differences between the tenth percentile 
points and between the ninetieth percentile points were also computed. 

For life age the sexes showed an insignificant difference between 
their means and medians. As far as the remaining twelve items on 
which measurements were made are concerned, we can be practically 
certain that the true difference between the means is above zero only 
for spelling and language usage, the girls excelling in both of them. 
The chances that the true difference is above zero in arithmetical 
reasoning and in nature study and science, in both of which the boys 
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excel, are close to practical certainty, but there is some chance that 
the true difference is zero or in favor of the girls. The statements 
made concerning the differences between the means also apply to the 
differences between the medians, excepting the case of language usage, 
in which the chances that the true difference is above zero are only 
ninety-five in one hundred. 

For the three different measures of variation a negligible amount 
of difference was found for life age. Concerning the remaining twelve 
items it can be said that the true difference is practically certain to 
be above zero only for the standard deviations in paragraph meaning 
in which the boys show the larger variation. 

A negligible amount of difference was found between the tenth 
percentile points and between the ninetieth percentile points for the 
two sexes in life age. The twenty-four differences between these 
points for the remaining twelve items are unreliable. 

Differences between the means, the medians, the standard devia- 
tions, the quartile deviations and the ranges from the tenth to the 
ninetieth percentile points in life age, school attendance, socio-eco- 
nomic status, mental age and educational age were computed for 
four hundred thirty-one girls and three hundred ninety-seven boys 
from the two larger groups. Excepting the differences between the 
median mental ages, the chances that anyone of these differences.is 
above zero are not even very high. Differences between the tenth 
percentile points and between the ninetieth percentile points for all 
of these five measurements are unreliable. 

In conclusion it may be said that, as far as averages are concerned, 
one can be practically certain that the true sex difference in spelling 
lies above zero and that it is in favor of the girls. That for language 
usage, reasoning in arithmetic, and nature study and science, the 
chances that the true differences are above zero are high but not 
sufficiently so to guarantee practical certainty. For variability it is 
probably safest to say that no true difference has been established. 
Moreover, excepting spelling, no true sex difference has been found 
for the tenth and the ninetieth percentile points. It is important to 
remember that the results of this investigation are applicable only to 


ten-year-olds, and that it is unsafe to apply them to children of any 
other age. 








A NOTE ON AN ERROR MADE IN INVESTIGATIONS 
| OF HOMOGENEOUS GROUPING! 


DAVID SEGEL 
Office of Education, Washington, D. C. 


There has been considerable literature lately on the advantages 
and disadvantages of the use of homogeneous grouping in schools. 
Some of these evaluations have used objective data. One of the 
objective methods and perhaps the most important method used in 
the evaluation of homogeneous grouping has been to find the relation- 
ship between the predicted criterion scores and the criterion scores 
themselves. It is an error in the method for finding this relationship 
which is here described. 

Let us explain briefly the meaning of the terms and the factors 
in this relationship. For example, an entering junior high school group 
may be divided into instructional groups in English on the basis of an 
English test or previous marks in English, or some other measure 
having some relation to ability in junior high school English. On 
the other hand this grouping may be done on the basis of a combination 
of such measures as mentioned. In this case the measures may be 
combined through the use of the regression equation. The regression 
equation contributes two factors in the prediction of scores. First, 
it gives us the optimum weighting of the various measures and second, 
it regresses the scores towards the mean. From the standpoint of 
grouping the first factor is the only one operating in the case of group- 
ing. The regression on scores towards the mean will not change the 
rank of a score and would therefore have no effect on grouping. This 
score used in grouping whether from a single measure or a combination 
of measures is called the predicted criterion score. If the groups of 
students in English are given marks on theif? work by the teacher or 
are tested with some English achievement test the marks or resulting 
test scores are called criterion scores. 

It is the relationship between this measure before grouping, 7.e., 
the predicted criterion score and the measure after grouping, 7.e., 
the criterion score which is used to measure the accuracy of grouping. 
This relationship is measured by the correlation coefficient or more 
simply by the resulting overlapping in the criterion scores between the 





1 This critical analysis was suggested to me following certain observations 
made by Clark L. Hull in his ‘‘ Aptitude Testing,” pp. 244-245. 
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groups. The overlapping and the correlation coefficient will vary 
directly although inversely and any change that can be shown to 
affect the one will necessarily change the other. 

Hughes found a correlation of .84 between a predicted criterion 
made up of intelligence quotients and a criterion made up of educa- 
tional quotients in a study of homogeneous grouping.' He also uses 
scatter diagrams to show the overlapping in his homogeneous groups. 
Similarly Burr? uses both the ‘correlation coefficient and over-lapping 
in his arguments against homogeneous grouping. Other studies of 
homogeneous grouping have used either one or both these methods of 
showing the efficiency of grouping. 

The error made by investigation of homogeneous grouping is that 
the correlation coefficient found between the predicted criterion and 
the criterion is not the correct one because it has not been corrected 
for attenuation in the criterion. Take, for instance, the correlation 
between a language prognosis test and teachers marks at the end of 
the semester. This correlation was found to be .55 in one instance. 
However, the correlation between the prognosis test at the beginning 
of the semester and a modern language new type test at the end of the 
semester was found to be .84. Now let us go further—suppose these 
same students were tested with still another language test at the end 
of the semester. The correlation between the language prognosis 
test and the two language tests together would be above .84. 

It will be readily admitted that interpretations of the value of 
homogeneous grouping will depend upon which of these correlations is 
used. It is to be noted that the actual grouping for any of these 
possibilities is the same. Therefore, the correlation coefficients and 
overlapping shown in the studies of Hughes and Burr and others does 
not give the true picture of the actual differentiation into groups since 
only inadequate measures of the true grouping have been used. 

The correct correlation is obtained by correcting the obtained 
correlation coefficients on account of attenuation in the criterion. 
The correction for attenuation has been thought to apply only in 
theoretical discussions and no doubt because of this its application to 
investigations of homogeneous grouping has been overlooked. The 





1 Hughes, W. Hardin: How Homogeneous is a Homogeneous Group. The 
Nation’s Schools, Vol. VI, No. 4, October, 1930. 

? Burr, Marvin V.: ‘‘A Study of Homogeneous Grouping in Terms of Individual 
Variations and the Teaching Problem.” Teachers College Contribution to Educa- 
tion, No. 457, Teachers College, Columbia University, 1931. 
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correction for attenuation in the criterion must be made in the cases 
of homogeneous grouping because we are considering the true dis- 
tribution of ability in the different groups. The true ability of a 
group is more homogeneous than can be demonstrated with any num- 
ber of measures less an infinite number. It is believed that there is 
need for reinterpretation of the results of most studies of homogeneous 
grouping. 
The formula for the correction for attenuation in this case is: 


aan 


Tr "Ties: 
Oleor Vro0 





where 7»; is the correlation between the predicted achievement and 
the actua lachievement and 7» is the reliability coefficient of the 
criterion. Where the reliability of the criterion is not known no accu- 
rate ro, can be calculated and only the minimum accuracy of a classi- 
fication be determined. However, for practical purposes in many 
cases some estimate of the reliability of the criterion can be made and a 
good estimate of the accuracy of a classification made. 





A FURTHER NOTE ON THE SIGNIFICANCE OF A 
DIFFERENCE BETWEEN THE MEANS OF 
MATCHED GROUPS 


E. F. LINDQUIST 
State University of Iowa 


In an article in the September, 1932 issue of this Journal, Ezekiel! 


made the statement that Wilks’? formula for the significance of a 


difference between the means of matched groups, previously discussed 
in the Journal by the writer,* represented essentially a rediscovery of a 
method suggested more than twenty years ago by “Student.’’* In 
the same article, Ezekiel stated further that Wilks’ formula is less 
accurate than “Student’s’” method, which is equivalent to saying 
that Wilks’ formula is invalid, and that it is less convenient in applica- 
tion. All of these statements are definitely incorrect, and it is the 
purpose of this article to correct any false impressions they may have 
created. 

Wilks’ formula is in no sense a rediscovery of the method previously 
suggested by “Student,’’ but represents a distinctly new contribution 
to the theory of mathematical statistics. The method suggested by 
Ezekiel has long been known to workers in the field of educational 
statistics. It was discarded by Wilks and the writer, however, because 
it is not valid in the situation for which Wilks’ formula is intended. 

In his previous article in this Journal’ and in a recent issue of 
Metron® Wilks presents a rigid mathematical proof that for a number 
of samples of the same size, selected so as to show identical distribu- 
tions of a given variable x but otherwise selected strictly at random 





1 Ezekiel, Mordecai: ‘‘Student’s’’ Method for Measuring the Significance of a 
Difference between Matched Groups. Journal of Educational Psychology, Vol. 
XXIII, Sept., 1932, pp. 446-450. 

? Wilks, Samuel S.: The Standard Error of the Means of ‘‘ Matched’’ Samples. 
Journal of Educational Psychology, Vol. XXII, March, 1931, pp. 205-208. 

3 Lindquist, E. F.: The Significance of a Difference between ‘‘Matched”’ 
Groups. Journal of Educational Psychology, Vol. XXII, March, 1931, pp. 197-204. 

4 “Student”’: The Probable Error of a Mean. Biometrika, Vol. VI, 1908, pp. 
1-25. 

5 Wilks: Loc. cit. 

6 Wilks, Samuel 8.: On the Distributions of Statistics in Samples from a Normal 
Population of Two Variables with Matched Sampling of One Variable. Metron, 
Vol. IX, No. 3-4, 1932. 
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with respect to a related variable y, the standard deviation of y means 
for these samples is described by the following formula: 








— yvi -r zy (1) 


in which y represents the y-mean of a single sample, oj represents the 
standard deviation of such means for a series of such samples, c, repre- 
sents the standard deviation of the y-variables for all samples, N 
represents the number of cases in the sample, and r,, represents the 
correlation between z and y in the matched samples. 

Now it follows that, if we have two sets of such samples with an 
identical distribution of the z-measures for the samples in each set 
but with a different correlation between variables z and y for the two 
sets, the standard error of the difference between the y-means for a 
pair of samples selected at random, one from each set, will be 


o a51-3: = Vo%, : + o*;, 


- HL — rey.) + Wil - Toys) (2) 


in which y; represents a y-variable in the first sample of the pair and y- 
represents a y-variable in the second sample of the pair. dj,_3, then 
represents the difference between the y-means for the two samples, and 


o4;,_,, Tepresents the standard error of this difference. The third 


term in the usual formula for the standard error of the difference 
between means drops out in this case, as there can obviously be no 
correlation between the y-means of the paired samples constituting 
a series of this kind, since each sample of each pair is selected at random 
from its own set. 

In the case in which the correlation between the z and y variables 
is the same in both sets of samples, the formula reduces to the simpler 


form of 
vonon= (9, + H)O- : 


which is equivalent to formula (9) in the earlier article by the writer.' 

“Student’s” method, which, by the way, was not devised by “‘Stu- 
dent”’ with this particular situation in mind, is applied by Ezekiel as 
follows: Assuming a perfect item-to-item matching for a given pair of 
matched samples (matched with respect to x), Ezekiel computes the 














1 Lindquist: Loc. cit., p. 202. 
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difference d,, _,) in the y measures for each pair of matched individ- 
uals and then determines the standard deviation cay, ) of the 


distribution of these differences for the two matched samples. Letting 
diy,—y,) Tepresent the mean of these differences, he then determines the 


standard error cay,-,,) of the mean difference by the following 
formula: 


Gj - Fd(y, =u) (4) 
(yi — y2) Jf N 


It is at this point that Ezekiel’s method breaks down. The rela- 
tionship between the standard error of the mean of a sample and the 
standard deviation of the population (for which the standard deviation 
of the sample is used as an approximation) which is expressed in 
formula (4) is one which holds only for strictly random samples. 
Matched samples, however, may not be considered as random in this 
sense. The standard error of the mean difference in the y measures 
is in general less than that given by (4) because of the restriction 


_ imposed by making the distribution of the related z-variable identical 


for all samples. The correct expression for the standard error of the 
mean difference in y measures for paired individuals in the matched 
groups, according to Wilks’ reasoning, is 


Cad, _ 
Tdy,-y, = UNV! - Tad y (5) 


Ys 








This expression would reduce to (4) only in the case in which there is 
no real correlation between the variable (x) used in matching and the 
differences between y measures of matched individuals, 7.e., only in 


the case when Tz, », = 0. It is to be noted, however, that (4) can 
be validly applied in such cases only after Tod, y, has been proven 


equal to zero by direct computation, 7.e., only by using (5) first. In 
general, then, formula (4) is not valid, since in the usual methods 
experimentation there is a real correlation between the variable used 
for matching and the differences in final achievement scores for matched 
individuals. While formula (5) is valid and may be used, it is in gen- 
eral less convenient than formulas (2) or (3). Formula (2) has the 
significant advantage that it does not require perfect pupil-to-pupil 
matching, but involves only the more expedient procedure of equating 
groups on the basis of means and standard deviations. For this 
reason, and also because it permits the use of matched groups of 
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d- unequal size, it avoids the wasteful elimination of original cases as eee: 
he well as the laborious manipulations of data characteristic of perfect | 
ng pupil-to-pupil matching procedures. 
he The statements made by Ezekiel, then, are all almost unqualifiedly | 


incorrect. His article is of interest, nevertheless, as demonstrating the 


” dangers in reasoning from manufactured data, without any reference oe 
to logical or theoretical considerations. The manufactured data which i 

4) he used, furthermore, are internally inconsistent, and are absurd in the 
light of common experience. For example, his hypothecated data ) 

a- show a correlation of .83 between IQ and achievement scores in arith- 

he metic—certainly a correlation of unusual magnitude for data of this 

on type. Worse, his hypothetical achievement scores show a correlation 

in between matched individuals (matched with respect to IQ) of .986! 

3. This obviously could never have happened in actual practice, and is 

is furthermore directly inconsistent with the given correlation between 

es achievement and IQ. The lowest correlation between achievement 

yn and IQ consistent with this correlation (of .986) between achievement 

al scores of matched individuals is .993! With this preposterous correla- 

he tion between achievement scores of matched pupils, he of course finds 

od a very small standard deviation in the differences between these scores, 
and hence a small standard error of the mean difference by “Student’s” 
method. 
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COMMUNICATIONS AND DISCUSSIONS 


THE BOGEY OF THE “‘ PERSONAL EQUATION” IN RATINGS: 
A REJOINDER 


The recent critical Discussion,! in this Journal, of my brief paper 
on ‘The Bogey of the ‘Personal Equation’ in Ratings of Intelligence’’? 
calls for some comment. In point of fact, this brief paper did not at 
all undertake to prove that the distrust of subjective ratings is unwar- 


_ ranted. The aim of the paper was quite plainly and explicitly limited 


to a demonstration—by means of correlations between ratings and an 
objective criterion—of the comparative insignificance of a certain 
factor which has sometimes been regarded as important. This factor, 
the personal equation, is defined as “‘spurious differences between 
judges in the means and standard deviations of their ratings’ (p. 147). 
Such a definition is in essential accord with the exact sense in which 
the term, personal equation, has been used by astronomers and statis- 
ticians in the past. 

While I find myself in agreement with most of the remarks of the 
recent critical Discussion, I fail to see their relevance to the point of 
my brief paper. It is gratifying to observe that the proof of the 
comparative insignificance of the personal equation, as defined, has not 
yet been challenged. I may add that a second study, now in press,* 
supplies experimental verification of the position taken in the first 
article. HERBERT S. ConraD. 

Institute of Child Welfare, University of California. 





1 Statistics Never Equivocate. Journal of Educational Psychology, Vol. XXIII, 
1932, pp. 465-466. 

2 Journal of Educational Psychology, Vol. XXIII, 1932, pp. 147-149. 

3 Journal of Social Psychology, November, 1932. 
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BOOK REVIEWS 


Epa@ar A. Dott, WintHROP M. PHELPs and Ruta TayLor MELCHER. 
Mental Deficiency Due to Birth Injuries. New York: The Mac- 
millan Co., 1932, Pp. XVI + 289. 


Some decades ago when feeble-mindedness was regarded as a prob- 
lem for the physician rather than the psychologist, much emphasis was 
placed upon cerebral lesions, both natal and post-natal as causative 
factors in producing mental defects. Later on it was demonstrated 
that normal or even very superior mental development is not precluded 
by difficult labor or artificial delivery, that complete recovery with 
no observable mental sequellae is not impossible, even after skull 
fracture or other severe head injuries. As a result of these findings 
the pendulum swung in the opposite direction. Brain injury as a 
possible cause of mental deficiency received but scant consideration. 
Indeed, the greater number of writers on the subject during the last 
quarter of a century seem to have devoted most of their time to an 
attempt to demonstrate its etiological insignificance rather than to 
describing the diagnostic symptoms by which it may be recognized. 

The authors of this book have therefore performed a timely 
service in calling attention to the fact that although difficult or pro- 
longed labor does not always result in mental deficiency, nevertheless 
serious mental and motor impairment does ensue in a sufficiently 
large proportion of cases to warrant much more emphasis upon birth 
injuries and their diagnosis than is given in most of the current text 
books. 

The present account is based upon an intensive study of twelve 
cases, nine male and three female, ranging in chronological age from 
four to thirty-nine years, in mental age from “about two”’ to thirteen 
years and in IQ (Stanford Binet) from thirty-one to ninety-four. 
The median IQ is sixty-seven. 

An introductory chapter of sixteen pages is devoted to a statement 
of the problem and discussion of its theoretical significance. This is 
followed by a general discussion of birth injuries and their physical 
signs as reported in the literature, with estimates by various authori- 
ties as to frequency in the general population and among the feeble- 
minded. The next two chapters are devoted to a description of the 
subjects, including the results of the neurological examination, 
birth and developmental histories. 
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The greater part of the book is given up to a very detailed pres- 
entation of mental test results. It is shown that the Stanford- 
Binet is applicable to birth-injured children without significant 
modification, but that tests in which the element of time is important, 
form-board tests, or those requiring much use of pencil and paper 
impose an undue handicap because of paralytic conditions and tremors 
that affect the use of the hands. Repeated administrations of the 
Binet tests suggest that the period of mental growth may be somewhat 
longer in birth-injured subjects than among the feebleminded in 
general; a conclusion which, if substantiated by further investigation, 
would lead to more favorable prognosis in these cases. However, 
in view of the small number of cases and the variation from case to 
case, relatively little confidence in the finding is warranted at present. 

A chapter on physical therapy describes the methods used for the 
individual cases, but draws no conclusions as to results. 

In view of the very marked motor disturbances shown by these 
subjects it is rather surprising to find that no tests that are frankly 
designed to measure motor abilities are reported as having been tried. 
One would like a more concrete picture of the extent of the motor 
handicaps present, particularly when a total of approximately fifty 
pages is devoted to a rather elaborate demonstration (involving 
correlation coefficients, mental ages, percentiles, graphs, and other 
statistical devices) designed to show that seven of the nine tests 
for which results are given were unsuitable for these subjects by reason 
of the motor handicap. It is also to be regretted that nowhere in the 
book is there to be found an adequate account of the every day behavior 
and personality of these children outside the psychological laboratory. 
We are not told what kind of work they can do, whether or not there 
seem to be personality disturbances other than the mental defects, 
how they amuse themselves, how they get on with their fellows. I 
am inclined to feel that a study of factors such as these would have 
been more worth while than the very detailed analysis of the mental 
test results. 

The chief value of the study should come in the future when the 
results of autopsies can be compared with the developmental records. 
Meanwhile Dr. Doll and his colleagues have performed an extremely 
valuable service in calling attention to the importance of birth injuries 
as causative factors in the production of mental defect. If their 
estimate that from five to ten per cent of the population of institutions 
for the feeble-minded are suffering from birth injuries is correct, it 
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seems evident that far too little emphasis has been placed upon this 
condition in current discussion and treatment of mental deficiency. 
FLORENCE L. GOODENOUGH. 
Institute of Child Welfare, 
University of Minnesota. 


Catvin P. Stone, CHEesTeR W. Darrow, CaArRNEY LANDIS, and 
Lena L. Heatu. Studies in the Dynamics of Behavior. Chicago. 
The University of Chicago Press, 1932, Pp. XIV + 332. 


K. S. Lashley, who has edited ‘‘Studies in the Dynamics of Behav- 
ior,’ terms the researches therein reported “an adventure in method- 
ology.”’ As such they are exploratory, and offer only tentative 
conclusions. 

The book contains three reports: the first, by Calvin P. Stone, 
on “‘Wildness and Savageness in Rats’’; the second, by Chester W. 
Darrow and Lena L. Heath, on ‘‘ Reaction Tendencies Relating to 
Personality”; the third, by Carney Landis, on ‘‘An Attempt to Meas- 
ure Emotional Traits in Juvenile Delinquency.” 

Dr. Stone, dealing with the problem of the effect of training upon 
inherited temperamental differences in rats, presents and analyzes 
his data with his customary clearness and simplicity. He arrives 
at a number of specific conclusions, the most important being that 
differences in the traits of wildness and savageness result from heredi- 
tary rather than environmental factors. Darrow and Heath have 
made a rather elaborate study of the relationships between certain 
physiological reactions and personality traits in an attempt to discover 
the organic basis of differences in emotionality and other tempera- 
mental variables. Their report is rather disappointing, perhaps 
because of the lack of positive conclusions. ‘‘In no instance,’ 
they write, ““have we demonstrated that temperamental type, as 
defined by the questionnaires used, is a result of the physiological 
traits measured.’”’ The joker may lie in ‘“‘the questionnaires used,”’ 
the validity of even the best of personality tests being open to question. 
In the third study, Dr. Landis investigates the question of whether 
emotional stability is a significant factor in juvenile delinquency. 
He was unable to proceed beyond an evaluation of available measuring 
instruments, for he found that ‘‘the present experimental test methods 
are not adapted to answer either positively or negatively the questions 
concerning personality traits.’’ Deprived of the means of making 
an objective approach, he was forced to leave the problem very 
much as he found it. 
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“Studies in the Dynamics of Behavior” adds little to the minute 
amount already known concerning the sources of human conduct, 
but it is intelligent groping in the dark. Probably a great deal of 
such groping in the field of personality will have to be done before 
even a few important facts are hit upon. Herspert A. CARROLL. 

University of Minnesota. 


Pavut 8S. AcuiLuss, Editor. Psychology at Work. Pp. 260. New 
York: McGraw-Hill Book Company, Inc., 1932. 


“Psychology at Work” is an answer to a need that every psycholo- 
gist has felt—a book that he may offer to his lay friends when they 
ask ‘‘What is psychology all about, anyway?” 

The book is a clearly and popularly written symposium on seven 
important applications of psychology, each section composed by a lead- 
ing authority in the field. It is a welcome relief from the general 
run of so-called popular psychology books, for it is authoritive with- 
out losing the qualities of interest and comprehensibility for the 
layman. 

The first two chapters on ‘‘Psychology and the Pre-school Child’’ 
and ‘“‘The Study and Guidance of Infant Behavior’ are by Lois Hayden 
Meek and Arnold Gessell, respectively. In the section ‘‘Psychology 
and Education” Arthur I. Gates devotes most of his space to modern 
diagnostic and remedial procedures in the skill subjects. “The 
Foundations of Personality” is by Mark A. May and a chapter on 
“Psychology and the Professions, Medicine, Law, and Theology” is 
excellently handled by Walter R. Miles. 

“Psychology and Industry” by Morris 8. Viteles summarizes the 
studies on job selection and adjustment and on accident prevention. 
Floyd H. Allport contributes a section on ‘‘Psychology in Relation to 
Social and Political Problems’”’ which presents an objective point of 
view typical of the best in modern social psychology. 

Although “Psychology at Work” is intended for non-professional 
use, it will not be out of place in the class room. It will provide 
interesting supplementary material for courses in Elementary or 
Applied Psychology. LAURANCE F. SHAFFER. 

Carnegie Institute of Technology. 


R. R. Rusk. Research in Education: An Introduction. Pp. 108. 
London: University of London Press, 1932. 


In this short monograph, which is an extension of a lecture delivered 
to Section L of the British Association in 1928, Dr. Rusk makes a 
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welcomed return to an early love. Twenty years ago he published 
an “Introduction to Experimental Education,’’ a capital book born 
somewhat prematurely. The new venture gives a rapid survey of the 
field of research in education, which he admits will be of greater value 
in Great Britain than in the United States where such things are taken 
more as a matter of course. He repeats Myers’ warning about the 
necessity for securing accurate primary data, a warning that may 
usefully be taken to heart by American educators. 
PETER SANDIFORD. 
University of Toronto. 


ApoLtpH W. OTTERSTEIN and Raymonp M. Mosuer. O-M Sight- 
singing Test. Stanford University: Stanford University Press, 
1932. 


This list was devised by Adolph Otterstein and Raymond Mosher of 
San Jose State Teachers College, and is intended to measure the music- 
reading achievement of college students. It consists of a series of 
melodies of various degrees of difficulty. The very high coefficients of 
reliability found by the authors probably result from the compensatory 
nature of errors in the test asa whole. The scoring will vary considera- 
bly with the examiner. There are several defects which may further 
invalidate results: The playing of a single triad, which does not estab- 
lish a tonality; the repetition of melodies which is no longer sight-read- 
ing but memory; the confusion of rhythm with metre, pitch with tone, 
use of notes of various values without indicating tempo; carelessness in 
calling a 24 metre 44, avoidance of dotted rhythms in the earlier exam- 
ples, and finally the equalization of all errors regardless of degree, abso- 
lute or relative pitch significance, or musical meaning. Asa result the 
test can serve only as a very general index of sight-reading ability and 
that only in terms of the examiner giving the test. 

OtTro ORTMANN. 
Peabody Conservatory of Music, Baltimore. 


Wiiu1am H. Burnuam. The Wholesome Personality, a Contribution to 
Mental Hygiene. Pp. XIV + 713. New York: D. Appleton & 
Co., 1932. 


With the increasing consideration given to mental hygiene as a 
necessity in the administration of schools, it may not be impertinent 
for a schoolmaster—a novice in mental hygiene—to express his 
opinion of the usefulness of Professor Burnham’s text. In fact, a 
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teacher cannot read far in this book before feeling quite at home 
for much of it reads like a philosophy of education. It may not be 
short of astonishing to many to find so much that is synomous in the 
two fields. A few quotations will demonstrate this assertation. For 
instance: “The great means of preserving and developing the whole- 
some personality is attentive activity, physical and mental... 
Children who are let alone in the early years to make their own 
contacts with the world, do learn something of the laws of possibility 
and of the inexorable facts of nature, and thus are saved from much 
conflict . . . Where a more powerful remedy is necessary (to remedy 
those survivals from childhood which are injurious to mental health), 
nothing is so valuable as devotion to an all-absorbing task . . . The 
teacher’s function is to give every child the opportunity for a fitting 
task; and it is the business of the teacher not only to perform this 
function every day, but in some way, at some time, in some subject, 
to give every child the stimulus of a distinct success . . . Pedagogy 
tries to stimulate the growth of personality by teacher’s instructions 
and discipline, hygiene by the child’s learning and self-direction . . . 
Hygiene points out that only by successful achievement comes the 
attitude of confidence essential for success, both in the school and in 
society . . . The wholesome personality itself can be understood only 
from the genetic point of view. Although the old view of human per- 
sonality as substantially the same at all periods of life has been dis- 
carded, nevertheless the tendency to treat children as if they could 
acquire and should practice adult virtues and ability at an early age, 
still persists.’’ Those seeking a better understanding of the aims of 
the progressive education movement will find much in this book to 
clarify their thinking. 

It will go far to orient the careful reader concerning the qualities 
of the wholesome personality (one cannot do better than use the 
author’s title). It is a presentation of conditions necessary to the 
prevention of personality disorders. In the preface the purpose is 
stated as “an attempt to present the scientific conception of the 
normal integrated personality, the conditions that seem favorable to 
its wholesome development, and also some of the conditions likely to 
produce personality disorders.”’ It is an extension of a former work, 
The Normal Mind, with broader conception and wider application 
due to the studies made in the intervening seven years. Twelve factors 
of personality form the outline which serves to emphasize some facts 
important for mental health. 
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Extensive bibliographies follow each chapter. They are preceded 
in each case by an adequate summary of the chapter. Two supple- 
mentary book lists are given at the end of the final chapter. 

By constant reiteration and explanation the concept of the inte- 
grated personality is made clear. It is a repetition not at all monoto- 
nous but necessary. Dr. Burnham writes with a clear, readable style 
which will contribute to make this book as popular as its predecessor. 
With due humility born of an appreciation of the present limits of 
information in the field, he offers many suggestive procedures. 

Physically, the book is attractive though its binding does not 
appear as strong as a book of this size requires for hard usage. 

The chapter on the Problem of Failure is interesting reading that 
will inspire sober thought. It epitomizes much to be found in the 
entire book. To my mind it serves to keep a stern problem before 
those charged with the future of education. We are living in a time 
hardly equaled in the history of our civilization for its insecurity. 
Consolidations in industry and commerce have come to make work 
moreimpersonal. Labor-saving machinery is being invented so rapidly 
that few men know how long they will havea job. Internationally the 
world seems less secure than ever before. Fears beset us. There is 
reasonable fear for democratic governments and with it, democratic 
procedures in all group living. Yet failure and fear are two of the 
greatest contributors to personality maladjustments while the normal 
democratic group and the purposive, worthwhile task for child and 
adult are considered two minimal essentials of the hygiene of person- 
ality. This being true, the need seems apparent to strive with renewed 
effort to determine what the school must do to attain and assure 
adequate preservation and development of wholesome personalities. 

J. H. CoLeMAn. 
Huntington, N. Y. 


TRUMAN LEE KELLEY. Scientific Method: Its Function in Research 
and in Education. New York: The Macmillan Co., 1932, 
Pp. X + 233. 


This stimulating volume by one who has contributed greatly to 
the creation of a science of education is essentially a collection of 
essays on the place and value of the scientific method in education. 
The book is not a unified treatise as one would find in a textbook 
for a course designed to train students for thesis writing and graduate 
research, but rather a loosely articulated set of lectures and articles 
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printed in a single volume, all dealing more or less with the same general 
topic. 

Not the least important part of the book is that which treats 
the relative contributions of the scientist and of the philosopher 
to the advancement of truth in the field of education. Also, the 
discussion of objectivity of measurement and that of the measurement 
of social outcomes of education are of exceptional interest and value. 
The last chapter, on Mental Traits of Men of Science, tells us much 
that is distinctly helpful and constructive by describing the principles 
and methods of work employed by a large number of the outstanding 
scientists. Perhaps one of the best ways to develop new scientists 
is to study the scientists who have already distinguished themselves 
and to search for the common traits or characteristics which they 
possessed. The book ends with a list of traits which the author says 
are invariably possessed by scientists of note, and other lists of traits 
which are likely to be present. It would be well if some one could 
tell us how to develop more of these traits in the graduate students 
whom we are grooming for the scientific leadership of the next 
generation. C. C. CRAWFORD. 

University of Southern California. 
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